Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve japanese word dictionary lookup with MeCab #11728

Open
leonard-slass opened this issue Apr 28, 2024 · 4 comments
Open

Improve japanese word dictionary lookup with MeCab #11728

leonard-slass opened this issue Apr 28, 2024 · 4 comments

Comments

@leonard-slass
Copy link

I read Japanese and looking up words is a bit touch and go. Sometimes it works great, sometimes it comes up with nonsense.

Japanese is tough because words are not separated with space. MeCab is a text segmenter that figures were the words are delimited.

For Koreader the input sentence would be:

太郎はこの本を二郎を見た女性に渡した。

MeCab would return:

太郎 は この 本 を 二郎 を 見 た 女性 に 渡し た 。

It's a C++ library so it is going to take a bit of work. If I worked on it would it be considered for inclusion ?

@Frenzie
Copy link
Member

Frenzie commented Apr 28, 2024

The plugin mentions MeCab. Pinging @cyphar

2. Text segmentation support without needing MeCab or any other binary helper,
by re-using the users' installed dictionaries to exhaustively try every
length of text and select the longest match which is present in the
dictionary. This is similar to how Yomichan does MeCab-less segmentation.

@cyphar
Copy link
Contributor

cyphar commented Apr 29, 2024

@leonard-slass Can you give an example of a bad lookup? The Japanese plugin uses the same logic as Yomichan which (while not perfect) almost always gives reasonable results for standard Japanese. It will struggle with non-standard or "muddled" Japanese though.


As for using MeCab: When doing dictionary lookups, we are actually not interested in segmenting the text. What you really need is just the dictionary form of a given word or phrase. This is something that MeCab also gives you (and it is better at dealing with "muddled" Japanese), but it is a separate problem to segmenting. Yes, the lack of spaces in most Japanese text makes lookups a bit harder, but because of set phrases it would be necessary to deal with multi-word lookups anyway.

The main issue with using MeCab is that the dictionary it uses is somewhat large (>20MB for the smallest dictionary -- almost as big as KOReader itself. UniDic is ~500MB.) and would require you to download a completely separate dictionary that is only used for segmenting. The dictionaries used by the current Japanese support plugin are also usable as general-purpose dictionaries so there is no "wasted" space when using the existing deinflector.

It's a C++ library so it is going to take a bit of work.

You almost certainly are going to want to execute MeCab as a binary rather than dealing with C++ bindings. And you also are going to need to make using it optional since the current de-inflection works in the vast majority of cases without needing extra dictionaries.

太郎 は この 本 を 二郎 を 見 た 女性 に 渡し た 。

FYI, this breakdown is not useful for dictionary lookups because the dictionary only contains the dictionary forms of words. You need to parse the default MeCab output to get the dictionary form of the word:

太郎    名詞,固有名詞,人名,名,*,*,太郎,タロウ,タロー
は      助詞,係助詞,*,*,*,*,は,ハ,ワ
この    連体詞,*,*,*,*,*,この,コノ,コノ
本      名詞,一般,*,*,*,*,本,ホン,ホン
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
二      名詞,数,*,*,*,*,二,ニ,ニ
郎      名詞,一般,*,*,*,*,郎,ロウ,ロー
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
見      動詞,自立,*,*,一段,連用形,見る,ミ,ミ
た      助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
女性    名詞,一般,*,*,*,*,女性,ジョセイ,ジョセイ
に      助詞,格助詞,一般,*,*,*,に,ニ,ニ
渡し    動詞,自立,*,*,五段・サ行,連用形,渡す,ワタシ,ワタシ
た      助動詞,*,*,*,特殊・タ,基本形,た,タ,タ
。      記号,句点,*,*,*,*,。,。,。
EOS

There are also a few other complications with using MeCab, which require a little bit of extra work. The primary one is that MeCab breaks the text into morphemes, but in a lot of cases you actually want to find the most precise entry in the dictionary. MeCab will break apart 気をつける into separate morphemes:

気      名詞,一般,*,*,*,*,気,キ,キ
を      助詞,格助詞,一般,*,*,*,を,ヲ,ヲ
つける  動詞,自立,*,*,一段,基本形,つける,ツケル,ツケル

But obviously when you're looking up 気をつける you want to get the actual dictionary entry for 気をつける. So you will need to do multiple dictionary lookups with each morph-bounded prefix of the search string (気、を、つける in the above example), but because you are not using the existing deinflector you will need to make sure that you use the as-written form until the last morph (which you switch to the dictionary form).

But yes, it is definitely possible to slot MeCab into the existing deinflection logic. Yomichan also has pluggable MeCab support, and I suspect it works in a similar way. If you do go about implementing it, I would suggest looking at how they implemented it as well.

@leonard-slass
Copy link
Author

@cyphar I guess the thing that bothers me the most is popping up the dictionary for a single hiragana. I think beginners know that に, を,etc... are particles. There could be an option for not opening the dictionary on those, highlight the character to give an indication that no better match has been found. Maybe the UX is all messed up :) Shall I make a different ticket for this?

I have started to import MeCab as a library, I will shift it to a binary as you have stated.

@leonard-slass
Copy link
Author

@cyphar to get the ball rolling I have made the changes to build MeCab.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants