MeCab is a tokenizer and POS (part of speech) tagged jointly developed by the Kyoto University Graduate School of Informatics and NTT. Development stopped in 2013, but it is still a useful tool. For exhaustive details see http://taku910.github.io/mecab/. By default, MeCab works with the IPAdic dictionary, which was designed for modern Japanese, but it can be tweaked to work with more specialized dictionaries as well.
Installing MeCab and RMeCab on Windows is straightforward
First, go to http://taku910.github.io/mecab and download the installer (mecab-0.996.exe), then install. When asked to choose a character set for the dictionary, select UTF-8.
Then start (or restart) RStudio and run the following line
install.packages("RMeCab", repos = "http://rmecab.jp/R")
Installing MeCab on a Mac is substantially more complicated. If you are using a university computer, you will need local support to give you “admin” privileges, or “sudo” privileges. If they ask why, just show them this page. Better yet, have them do it for you. In fact, unless the following bash scripts make sense, just have local support do the installs.
First, either:
Install Xcode from the AppStore, or
Open Terminal in the Utilities folder. Copy this line, paste at the prompt, and hit return
xcode-select --install
$ ls -la /usr/local
$ sudo chown $YOUR_USER_NAME:admin /usr/local && sudo chown -R $YOUR_USER_NAME:admin /usr/local
$ cd ~/Downloads
$ tar xf mecab-0.996.tar.gz
$ cd mecab-0.996
$ ./configure --with-charset=utf8
$ make
$ sudo make install
$ cd ~/Downloads
$ tar xf mecab-ipadic-2.7.0-20070801.tar.gz
$ cd mecab-ipadic-2.7.0-20070801
$ ./configure --with-charset=utf-8
$ make
$ sudo make install
$ mecab
これはペンです
これ 代名詞,,,,,,これ,コレ,コレ
は 助詞,係助詞,,,,,は,ワ,ワ
ペン 名詞,普通名詞,一般,,,,ペン,ペン,ペン
です 助動詞,,,,助動詞-デス,終止形-一般,です,デス,デス
EOS
Hit Control c to quit MeCab
Finally start (or restart) RStudio and run the following line
install.packages("RMeCab", repos = "http://rmecab.jp/R")
UniDic is series of dictionary files developed by NINJAL (National Institute for Japanese Language and Linguistics). They allow more fine tuned tokenzing of Japanese. MeCab can be directed to use UniDic dictionaries instead of IPA.
Download the file unidic-mecab-2.1.2_windows.exe from https://osdn.net/projects/unidic/releases/. Run the INSTALL.exe file. The installer will create a program called “unidic”
There are several specialized version of UniDic for early-modern and medieval Japanese. See https://unidic.ninjal.ac.jp/
Make a copy of the file Makefile.bat (as backup) and change the name slightlu. Using a text editor rewrite the eleventh line of Makefile.bat. Use a semicolon to “comment out.” The line directing MeCab to ipadic. Point it instead to the directory where you have UniDic. You can also point it to a specialized UniDic dictionary
; dicdir = $(rcpath)\..\dic\ipadic
dicdir = $(rcpath)\..\dic\unidic
You will need to install from binaries and run some bash scripts.
Download the source files from http://sourceforge.jp/projects/unidic/downloads/58338/unidic-mecab-2.1.2_src.zip
At Terminal, run the following
$ unzip unidic-mecab-2.1.2_src.zip
$ cd unidic-mecab-2.1.2_src/
$ ./configure
$ make
$ sudo make install
;; bos-feature = BOS/EOS,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
bos-feature = BOS/EOS,*,*,*,*,*,*,*,*
;; node-format-unidic = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
node-format-unidic = %m\t%f[0],%f[1],%f[2],%f[3],%f[4],%f[5],%f[10],%f[9],%f[11]\n
;; unk-format-unidic = %m\t%m\t%m\t%m\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
unk-format-unidic = %m\t%f[0],%f[1],%f[2],%f[3],%f[4],%f[5],%f[10],%f[9],%f[11]\n
$ cp /usr/local/etc/mecabrc /Users/NAME/
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
; ipadicの行を ; でコメントアウトして、代わりに1行追加します
;dicdir = /usr/local/lib/mecab/dic/ipadic
dicdir = /usr/local/lib/mecab/dic/unidic
$ mv /Users/NAME/mecabrc /Users/NAME/.mecabrc
$ echo "どの方法を用ゐても良い、用ふことを怠るくらゐなら。" | mecab
どの 連体詞,,,,,,どの,ドノ,ドノ
方法 名詞,普通名詞,一般,,,,方法,ホーホー,ホーホー
を 助詞,格助詞,,,,,を,オ,オ
用ゐ 動詞,一般,,,文語上一段-ワ行,連用形-一般,用ゐる,モチー,モチール
て 助詞,接続助詞,,,,,て,テ,テ
も 助詞,係助詞,,,,,も,モ,モ
良い 形容詞,非自立可能,,,形容詞,連体形-一般,良い,ヨイ,ヨイ
、 補助記号,読点,,,,,、,,
用 名詞,普通名詞,一般,,,,用,ヨー,ヨー
ふ 接尾辞,名詞的,一般,,,,ふ,フ,フ
こと 名詞,普通名詞,一般,,,,こと,コト,コト
を 助詞,格助詞,,,,,を,オ,オ
怠る 動詞,一般,,,五段-ラ行,終止形-一般,怠る,オコタル,オコタル
くらゐ 助詞,副助詞,,,,,くらゐ,クライ,クライ
なら 助動詞,,,,助動詞-ダ,仮定形-一般,だ,ナラ,ダ
。 補助記号,句点,,,,,。,,
EOS
$ echo "どの方法を用ゐても良い、用ふことを怠るくらゐなら。" | mecab -d /usr/local/lib/mecab/dic/ipadic/
どの 連体詞,*,*,*,*,*,どの,ドノ,ドノ
方法 名詞,一般,*,*,*,*,方法,ホウホウ,ホーホー
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
用 名詞,一般,*,*,*,*,用,ヨウ,ヨー
ゐ 動詞,自立,*,*,一段,連用形,ゐる,ヰ,イ
て 助詞,接続助詞,*,*,*,*,て,テ,テ
も 助詞,係助詞,*,*,*,*,も,モ,モ
良い 形容詞,非自立,*,*,形容詞・アウオ段,基本形,良い,ヨイ,ヨイ
、 記号,読点,*,*,*,*,、,、,、
用 名詞,一般,*,*,*,*,用,ヨウ,ヨー
ふ 動詞,自立,*,*,五段・ラ行,体言接続特殊2,ふる,フ,フ
こと 名詞,非自立,一般,*,*,*,こと,コト,コト
を 助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
怠る 動詞,自立,*,*,五段・ラ行,基本形,怠る,オコタル,オコタル
くら 名詞,一般,*,*,*,*,くら,クラ,クラ
ゐ 動詞,自立,*,*,一段,連用形,ゐる,ヰ,イ
なら 助動詞,*,*,*,特殊・ダ,仮定形,だ,ナラ,ナラ
。 記号,句点,*,*,*,*,。,。,。
EOS
NINJAL’s many specialized dictionaries can be used (with some effort) with RMeCab. The files are at https://unidic.ninjal.ac.jp/download_all#unidic_chj
Unzip the files and, for both each dictionary (e.g. 近代文語 or 中古和文) duplicate the folder unidic-mecab. The folder(s) should contain these files.
Rename the folders unidic_kindai and unidic_chuko as appropriate. Copy or move these two folders to the directory containing the base unidic folder, probably /usr/local/lib/mecab/dic. Search for the file dicrc to be sure.
You can now direct MeCab to either of these alternative dictionaries by editing the .mecabrc file, which is at /Users/NAME/.mecabrc. For example, to use 近代文語UniDic . . .
;; dicdir = /usr/local/lib/mecab/dic/unidic
dicdir = /usr/local/lib/mecab/dic/unidic_kindai
It’s worth noting that the difference between these three tokenizers can be rather small. For example, on the test case of Kitamura Tokoku’s Naibu seimeiron (北村透谷著、内部生命論), suggested by NINJAL:
Tokenizer | Number of tokens |
---|---|
現代語 | 5447 |
近代文語 | 5309 |
中古和文 | 5319 |
IPAdic | 5363 |
Nonetheless, if you want exacting tokenizing and POS tagging, you have three options. Of course, NINJAL’s newest web-based GUI http://chamame.ninjal.ac.jp/ offers ten options: