New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve japanese word dictionary lookup with MeCab #11728
Comments
The plugin mentions MeCab. Pinging @cyphar koreader/plugins/japanese.koplugin/README.md Lines 10 to 13 in 34abb4e
|
@leonard-slass Can you give an example of a bad lookup? The Japanese plugin uses the same logic as Yomichan which (while not perfect) almost always gives reasonable results for standard Japanese. It will struggle with non-standard or "muddled" Japanese though. As for using MeCab: When doing dictionary lookups, we are actually not interested in segmenting the text. What you really need is just the dictionary form of a given word or phrase. This is something that MeCab also gives you (and it is better at dealing with "muddled" Japanese), but it is a separate problem to segmenting. Yes, the lack of spaces in most Japanese text makes lookups a bit harder, but because of set phrases it would be necessary to deal with multi-word lookups anyway. The main issue with using MeCab is that the dictionary it uses is somewhat large (>20MB for the smallest dictionary -- almost as big as KOReader itself. UniDic is ~500MB.) and would require you to download a completely separate dictionary that is only used for segmenting. The dictionaries used by the current Japanese support plugin are also usable as general-purpose dictionaries so there is no "wasted" space when using the existing deinflector.
You almost certainly are going to want to execute MeCab as a binary rather than dealing with C++ bindings. And you also are going to need to make using it optional since the current de-inflection works in the vast majority of cases without needing extra dictionaries.
FYI, this breakdown is not useful for dictionary lookups because the dictionary only contains the dictionary forms of words. You need to parse the default MeCab output to get the dictionary form of the word:
There are also a few other complications with using MeCab, which require a little bit of extra work. The primary one is that MeCab breaks the text into morphemes, but in a lot of cases you actually want to find the most precise entry in the dictionary. MeCab will break apart
But obviously when you're looking up But yes, it is definitely possible to slot MeCab into the existing deinflection logic. Yomichan also has pluggable MeCab support, and I suspect it works in a similar way. If you do go about implementing it, I would suggest looking at how they implemented it as well. |
@cyphar I guess the thing that bothers me the most is popping up the dictionary for a single hiragana. I think beginners know that に, を,etc... are particles. There could be an option for not opening the dictionary on those, highlight the character to give an indication that no better match has been found. Maybe the UX is all messed up :) Shall I make a different ticket for this? I have started to import MeCab as a library, I will shift it to a binary as you have stated. |
I read Japanese and looking up words is a bit touch and go. Sometimes it works great, sometimes it comes up with nonsense.
Japanese is tough because words are not separated with space. MeCab is a text segmenter that figures were the words are delimited.
For Koreader the input sentence would be:
太郎はこの本を二郎を見た女性に渡した。
MeCab would return:
太郎 は この 本 を 二郎 を 見 た 女性 に 渡し た 。
It's a C++ library so it is going to take a bit of work. If I worked on it would it be considered for inclusion ?
The text was updated successfully, but these errors were encountered: