Improve japanese word dictionary lookup with MeCab #11728

leonard-slass · 2024-04-28T14:46:06Z

I read Japanese and looking up words is a bit touch and go. Sometimes it works great, sometimes it comes up with nonsense.

Japanese is tough because words are not separated with space. MeCab is a text segmenter that figures were the words are delimited.

For Koreader the input sentence would be:

太郎はこの本を二郎を見た女性に渡した。

MeCab would return:

太郎はこの本を二郎を見た女性に渡した。

It's a C++ library so it is going to take a bit of work. If I worked on it would it be considered for inclusion ?

Frenzie · 2024-04-28T15:01:58Z

The plugin mentions MeCab. Pinging @cyphar

koreader/plugins/japanese.koplugin/README.md

Lines 10 to 13 in 34abb4e

    
           2. Text segmentation support without needing MeCab or any other binary helper, 
        
              by re-using the users' installed dictionaries to exhaustively try every 
        
              length of text and select the longest match which is present in the 
        
              dictionary. This is similar to how Yomichan does MeCab-less segmentation.

cyphar · 2024-04-29T01:16:38Z

@leonard-slass Can you give an example of a bad lookup? The Japanese plugin uses the same logic as Yomichan which (while not perfect) almost always gives reasonable results for standard Japanese. It will struggle with non-standard or "muddled" Japanese though.

As for using MeCab: When doing dictionary lookups, we are actually not interested in segmenting the text. What you really need is just the dictionary form of a given word or phrase. This is something that MeCab also gives you (and it is better at dealing with "muddled" Japanese), but it is a separate problem to segmenting. Yes, the lack of spaces in most Japanese text makes lookups a bit harder, but because of set phrases it would be necessary to deal with multi-word lookups anyway.

The main issue with using MeCab is that the dictionary it uses is somewhat large (>20MB for the smallest dictionary -- almost as big as KOReader itself. UniDic is ~500MB.) and would require you to download a completely separate dictionary that is only used for segmenting. The dictionaries used by the current Japanese support plugin are also usable as general-purpose dictionaries so there is no "wasted" space when using the existing deinflector.

It's a C++ library so it is going to take a bit of work.

You almost certainly are going to want to execute MeCab as a binary rather than dealing with C++ bindings. And you also are going to need to make using it optional since the current de-inflection works in the vast majority of cases without needing extra dictionaries.

太郎はこの本を二郎を見た女性に渡した。

FYI, this breakdown is not useful for dictionary lookups because the dictionary only contains the dictionary forms of words. You need to parse the default MeCab output to get the dictionary form of the word:

太郎    名詞,固有名詞,人名,名,*,*,太郎,タロウ,タロー
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
この    連体詞,*,*,*,*,*,この,コノ,コノ
本      名詞,一般,*,*,*,*,本,ホン,ホン
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
二      名詞,数,*,*,*,*,二,ニ,ニ
郎      名詞,一般,*,*,*,*,郎,ロウ,ロー
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
見      動詞,自立,*,*,一段,連用形,見る,ミ,ミ
た      助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
女性    名詞,一般,*,*,*,*,女性,ジョセイ,ジョセイ
に      助詞,格助詞,一般,*,*,*,に,ニ,ニ
渡し    動詞,自立,*,*,五段・サ行,連用形,渡す,ワタシ,ワタシ
た      助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。      記号,句点,*,*,*,*,。,。,。
EOS

There are also a few other complications with using MeCab, which require a little bit of extra work. The primary one is that MeCab breaks the text into morphemes, but in a lot of cases you actually want to find the most precise entry in the dictionary. MeCab will break apart 気をつける into separate morphemes:

気      名詞,一般,*,*,*,*,気,キ,キ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
つける  動詞,自立,*,*,一段,基本形,つける,ツケル,ツケル

But obviously when you're looking up 気をつける you want to get the actual dictionary entry for 気をつける. So you will need to do multiple dictionary lookups with each morph-bounded prefix of the search string (気、を、つける in the above example), but because you are not using the existing deinflector you will need to make sure that you use the as-written form until the last morph (which you switch to the dictionary form).

But yes, it is definitely possible to slot MeCab into the existing deinflection logic. Yomichan also has pluggable MeCab support, and I suspect it works in a similar way. If you do go about implementing it, I would suggest looking at how they implemented it as well.

leonard-slass · 2024-04-29T13:20:50Z

@cyphar I guess the thing that bothers me the most is popping up the dictionary for a single hiragana. I think beginners know that に, を,etc... are particles. There could be an option for not opening the dictionary on those, highlight the character to give an indication that no better match has been found. Maybe the UX is all messed up :) Shall I make a different ticket for this?

I have started to import MeCab as a library, I will shift it to a binary as you have stated.

leonard-slass · 2024-04-29T15:57:23Z

@cyphar to get the ball rolling I have made the changes to build MeCab.

leonard-slass added the enhancement label Apr 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve japanese word dictionary lookup with MeCab #11728

Improve japanese word dictionary lookup with MeCab #11728

leonard-slass commented Apr 28, 2024

Frenzie commented Apr 28, 2024 •

edited

cyphar commented Apr 29, 2024 •

edited

leonard-slass commented Apr 29, 2024

leonard-slass commented Apr 29, 2024

Improve japanese word dictionary lookup with MeCab #11728

Improve japanese word dictionary lookup with MeCab #11728

Comments

leonard-slass commented Apr 28, 2024

Frenzie commented Apr 28, 2024 • edited

cyphar commented Apr 29, 2024 • edited

leonard-slass commented Apr 29, 2024

leonard-slass commented Apr 29, 2024

Frenzie commented Apr 28, 2024 •

edited

cyphar commented Apr 29, 2024 •

edited