I had attended Japanese classes recently and found a piece of opensource software called MeCab ( http://taku910.github.io/mecab/ ) which was developed by Kyoto University. It is an offline software so one would have to install it on computer. I want to put it online so that it is accessible by other people without installing the software.
Here is an example of using MeCab to tokenize Japanese text:
% mecab すもももももももものうち すもも 名詞,一般,*,*,*,*,すもも,スモモ,スモモ も 助詞,係助詞,*,*,*,*,も,モ,モ もも 名詞,一般,*,*,*,*,もも,モモ,モモ も 助詞,係助詞,*,*,*,*,も,モ,モ もも 名詞,一般,*,*,*,*,もも,モモ,モモ の 助詞,連体化,*,*,*,*,の,ノ,ノ うち 名詞,非自立,副詞可能,*,*,*,うち,ウチ,ウチ EOS
Because Japanese text does not have a space between each word, you couldn’t judge which one is a word. For example, this sentence was composed of 8 characters of も ( mo ), it is so complex for average learner and even google translate couldn’t understand it:
Actually すもももももももものうち ( sumomomomomomomomonouchi ) translate to “Both plum and peach are a kind of prunus ). From wikipedia: “Prunus is a genus of trees and shrubs, which includes the plums, cherries, peaches, nectarines, apricots, and almonds. ” The actual sentence structure is this – “すもも も もも も もも の うち” ( Plum and peach and ( is ) peach of type ).
What I want to do now is to provide this Japanese tokenizer online, in a similar UI like google translate, in which there are two textbox, you put in words in left-hand side and get the result from the right-hand side.
It would involve two components, a frontend UI powered by a web server, with Semantic UI or JQueryUI, and a backend RESTful API written by Java or Python. I still haven’t decide to code it in Java or Python yet.
Let’s continue later.