What is MeCab?

MeCab is a tokenizer and POS (part of speech) tagged jointly developed by the Kyoto University Graduate School of Informatics and NTT. Development stopped in 2013, but it is still a useful tool. For exhaustive details see http://taku910.github.io/mecab/. By default, MeCab works with the IPAdic dictionary, which was designed for modern Japanese, but it can be tweaked to work with more specialized dictionaries as well.

Installing on Windows (32/64)

Installing MeCab and RMeCab on Windows is straightforward

  • First, go to http://taku910.github.io/mecab and download the installer (mecab-0.996.exe), then install. When asked to choose a character set for the dictionary, select UTF-8.

  • Then start (or restart) RStudio and run the following line

install.packages("RMeCab", repos = "http://rmecab.jp/R")

Installing on Mac

Installing MeCab on a Mac is substantially more complicated. If you are using a university computer, you will need local support to give you “admin” privileges, or “sudo” privileges. If they ask why, just show them this page. Better yet, have them do it for you. In fact, unless the following bash scripts make sense, just have local support do the installs.

  • First, either:

    • Install Xcode from the AppStore, or

    • Open Terminal in the Utilities folder. Copy this line, paste at the prompt, and hit return

xcode-select --install
  • Now from Terminal, confirm that the directory /usr/local exists
$ ls -la /usr/local
  • If there’s no error message, get write privileges to the relevant directories with the command below. Enter your password at the prompt.
$ sudo chown $YOUR_USER_NAME:admin /usr/local && sudo chown -R $YOUR_USER_NAME:admin /usr/local
  • To install MeCab
$ cd  ~/Downloads
$ tar xf mecab-0.996.tar.gz
$ cd mecab-0.996
$ ./configure --with-charset=utf8
$ make
$ sudo make install
  • To install the basic dictionary for MeCab, the IPAdic dictionaries
 $ cd ~/Downloads 
 $ tar xf mecab-ipadic-2.7.0-20070801.tar.gz
 $ cd mecab-ipadic-2.7.0-20070801
 $ ./configure --with-charset=utf-8
 $ make
 $ sudo make install
  • To check that MeCab is correctly installed, in Terminal, run the first line below to start MeCab. Then run the second line to give MeCab some text to parse.
$ mecab
これはペンです
  • You should see MeCab’s tokenization of the text, including POS (part of speech) tagging
これ  代名詞,,,,,,これ,コレ,コレ
は   助詞,係助詞,,,,,は,ワ,ワ
ペン  名詞,普通名詞,一般,,,,ペン,ペン,ペン
です  助動詞,,,,助動詞-デス,終止形-一般,です,デス,デス
EOS
  • Hit Control c to quit MeCab

  • Finally start (or restart) RStudio and run the following line

install.packages("RMeCab", repos = "http://rmecab.jp/R")

UniDic and RMeCab

UniDic is series of dictionary files developed by NINJAL (National Institute for Japanese Language and Linguistics). They allow more fine tuned tokenzing of Japanese. MeCab can be directed to use UniDic dictionaries instead of IPA.

Installing UniDic on Windows

  • Download the file unidic-mecab-2.1.2_windows.exe from https://osdn.net/projects/unidic/releases/. Run the INSTALL.exe file. The installer will create a program called “unidic”

  • There are several specialized version of UniDic for early-modern and medieval Japanese. See https://unidic.ninjal.ac.jp/

  • Make a copy of the file Makefile.bat (as backup) and change the name slightlu. Using a text editor rewrite the eleventh line of Makefile.bat. Use a semicolon to “comment out.” The line directing MeCab to ipadic. Point it instead to the directory where you have UniDic. You can also point it to a specialized UniDic dictionary

; dicdir =  $(rcpath)\..\dic\ipadic
dicdir =  $(rcpath)\..\dic\unidic

Installing on OSX - UniDic(現代語版)

You will need to install from binaries and run some bash scripts.

 $  unzip unidic-mecab-2.1.2_src.zip
 $  cd unidic-mecab-2.1.2_src/
 $  ./configure
 $  make
 $  sudo make install
  • Now make three minor edits in the file /usr/local/lib/mecab/dic/unidic/dicrc using a simpel text editor, such as TextEdit, Emacs, or Vim. If using TextEdit, make sure to save as plain text. “Comment out” the original code with the semi-colons and replace with the examples below.
;;  bos-feature = BOS/EOS,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*,*
bos-feature = BOS/EOS,*,*,*,*,*,*,*,*

;;  node-format-unidic = %m\t%f[9]\t%f[6]\t%f[7]\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
node-format-unidic = %m\t%f[0],%f[1],%f[2],%f[3],%f[4],%f[5],%f[10],%f[9],%f[11]\n
;;  unk-format-unidic  = %m\t%m\t%m\t%m\t%F-[0,1,2,3]\t%f[4]\t%f[5]\n
unk-format-unidic = %m\t%f[0],%f[1],%f[2],%f[3],%f[4],%f[5],%f[10],%f[9],%f[11]\n
  • Make a copy of the file mecabr
 $  cp /usr/local/etc/mecabrc  /Users/NAME/
  • As above, edit the file
; $Id: mecabrc.in,v 1.3 2006/05/29 15:36:08 taku-ku Exp $;
; ipadicの行を ; でコメントアウトして、代わりに1行追加します
;dicdir =  /usr/local/lib/mecab/dic/ipadic
dicdir =  /usr/local/lib/mecab/dic/unidic
  • Rename the file
$ mv /Users/NAME/mecabrc /Users/NAME/.mecabrc
  • In Terminal, run the following to test
$ echo "どの方法を用ゐても良い、用ふことを怠るくらゐなら。" | mecab
  • You should see . . .
どの    連体詞,,,,,,どの,ドノ,ドノ
方法    名詞,普通名詞,一般,,,,方法,ホーホー,ホーホー
を    助詞,格助詞,,,,,を,オ,オ
用ゐ    動詞,一般,,,文語上一段-ワ行,連用形-一般,用ゐる,モチー,モチール
て    助詞,接続助詞,,,,,て,テ,テ
も    助詞,係助詞,,,,,も,モ,モ
良い    形容詞,非自立可能,,,形容詞,連体形-一般,良い,ヨイ,ヨイ
、    補助記号,読点,,,,,、,,
用    名詞,普通名詞,一般,,,,用,ヨー,ヨー
ふ    接尾辞,名詞的,一般,,,,ふ,フ,フ
こと    名詞,普通名詞,一般,,,,こと,コト,コト
を    助詞,格助詞,,,,,を,オ,オ
怠る    動詞,一般,,,五段-ラ行,終止形-一般,怠る,オコタル,オコタル
くらゐ    助詞,副助詞,,,,,くらゐ,クライ,クライ
なら    助動詞,,,,助動詞-ダ,仮定形-一般,だ,ナラ,ダ
。    補助記号,句点,,,,,。,,
EOS
  • For comparison, try the IPAdic
$ echo "どの方法を用ゐても良い、用ふことを怠るくらゐなら。" | mecab -d /usr/local/lib/mecab/dic/ipadic/
  • Which should generate . . .
どの    連体詞,*,*,*,*,*,どの,ドノ,ドノ
方法    名詞,一般,*,*,*,*,方法,ホウホウ,ホーホー
を    助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
用    名詞,一般,*,*,*,*,用,ヨウ,ヨー
ゐ    動詞,自立,*,*,一段,連用形,ゐる,ヰ,イ
て    助詞,接続助詞,*,*,*,*,て,テ,テ
も    助詞,係助詞,*,*,*,*,も,モ,モ
良い    形容詞,非自立,*,*,形容詞・アウオ段,基本形,良い,ヨイ,ヨイ
、    記号,読点,*,*,*,*,、,、,、
用    名詞,一般,*,*,*,*,用,ヨウ,ヨー
ふ    動詞,自立,*,*,五段・ラ行,体言接続特殊2,ふる,フ,フ
こと    名詞,非自立,一般,*,*,*,こと,コト,コト
を    助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
怠る    動詞,自立,*,*,五段・ラ行,基本形,怠る,オコタル,オコタル
くら    名詞,一般,*,*,*,*,くら,クラ,クラ
ゐ    動詞,自立,*,*,一段,連用形,ゐる,ヰ,イ
なら    助動詞,*,*,*,特殊・ダ,仮定形,だ,ナラ,ナラ
。    記号,句点,*,*,*,*,。,。,。
EOS

Pairing specialized dictionaries with RMeCab

NINJAL’s many specialized dictionaries can be used (with some effort) with RMeCab. The files are at https://unidic.ninjal.ac.jp/download_all#unidic_chj

  • Unzip the files and, for both each dictionary (e.g. 近代文語 or 中古和文) duplicate the folder unidic-mecab. The folder(s) should contain these files.

    • char.bin
    • dicrc
    • matrix.bin
    • sys.dic
    • unc.dic
  • Rename the folders unidic_kindai and unidic_chuko as appropriate. Copy or move these two folders to the directory containing the base unidic folder, probably /usr/local/lib/mecab/dic. Search for the file dicrc to be sure.

  • You can now direct MeCab to either of these alternative dictionaries by editing the .mecabrc file, which is at /Users/NAME/.mecabrc. For example, to use 近代文語UniDic . . .

;; dicdir =  /usr/local/lib/mecab/dic/unidic
dicdir =  /usr/local/lib/mecab/dic/unidic_kindai

It’s worth noting that the difference between these three tokenizers can be rather small. For example, on the test case of Kitamura Tokoku’s Naibu seimeiron (北村透谷著、内部生命論), suggested by NINJAL:

Tokenizer Number of tokens
現代語 5447
近代文語 5309
中古和文 5319
IPAdic 5363

Nonetheless, if you want exacting tokenizing and POS tagging, you have three options. Of course, NINJAL’s newest web-based GUI http://chamame.ninjal.ac.jp/ offers ten options:

  • 現代語
  • 現代語話し言葉
  • 旧仮名口語
  • 近代文語
  • 近世口語(洒落本)
  • 中世口語(狂言)
  • 中世文語(説話・随筆)
  • 中古和文
  • 上代(万葉集)
  • IPAdic(現代語