Project I have in mind – (1) A Japanese text tokenizer online

Photo by Charles Deluvio ๐Ÿ‡ต๐Ÿ‡ญ๐Ÿ‡จ๐Ÿ‡ฆ on Unsplash


I had attended Japanese classes recently and found a piece of opensource software called MeCab ( ) which was developed by Kyoto University. It is an offline software so one would have to install it on computer. I want to put it online so that it is accessible by other people without installing the software.

Here is an example of using MeCab to tokenize Japanese text:

% mecab
ใ™ใ‚‚ใ‚‚  ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ™ใ‚‚ใ‚‚,ใ‚นใƒขใƒข,ใ‚นใƒขใƒข
ใ‚‚      ๅŠฉ่ฉž,ไฟ‚ๅŠฉ่ฉž,*,*,*,*,ใ‚‚,ใƒข,ใƒข
ใ‚‚ใ‚‚    ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ‚‚ใ‚‚,ใƒขใƒข,ใƒขใƒข
ใ‚‚      ๅŠฉ่ฉž,ไฟ‚ๅŠฉ่ฉž,*,*,*,*,ใ‚‚,ใƒข,ใƒข
ใ‚‚ใ‚‚    ๅ่ฉž,ไธ€่ˆฌ,*,*,*,*,ใ‚‚ใ‚‚,ใƒขใƒข,ใƒขใƒข
ใฎ      ๅŠฉ่ฉž,้€ฃไฝ“ๅŒ–,*,*,*,*,ใฎ,ใƒŽ,ใƒŽ
ใ†ใก    ๅ่ฉž,้ž่‡ช็ซ‹,ๅ‰ฏ่ฉžๅฏ่ƒฝ,*,*,*,ใ†ใก,ใ‚ฆใƒ,ใ‚ฆใƒ

Because Japanese text does not have a space between each word, you couldn’t judge which one is a word. For example, this sentence was composed of 8 characters of ใ‚‚ ( mo ), it is so complex for average learner and even google translate couldn’t understand it:

Actually ใ™ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใ‚‚ใฎใ†ใก ( sumomomomomomomomonouchi ) translate to “Both plum and peach are a kind of prunus ). From wikipedia: “Prunus is a genus of trees and shrubs, which includes the plums, cherries, peaches, nectarines, apricots, and almonds. ” The actual sentence structure is this – “ใ™ใ‚‚ใ‚‚ใ€€ใ‚‚ใ€€ใ‚‚ใ‚‚ใ€€ใ‚‚ใ€€ใ‚‚ใ‚‚ใ€€ใฎใ€€ใ†ใก” ( Plum and peach and ( is ) peach of type ).

What I want to do now is to provide this Japanese tokenizer online, in a similar UI like google translate, in which there are two textbox, you put in words in left-hand side and get the result from the right-hand side.

It would involve two components, a frontend UI powered by a web server, with Semantic UI or JQueryUI, and a backend RESTful API written by Java or Python. I still haven’t decide to code it in Java or Python yet.

Let’s continue later.

Leave a Reply

Your email address will not be published. Required fields are marked *