-
Notifications
You must be signed in to change notification settings - Fork 240
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adding an option for kindle output to work around the bugs in their dictionary lookup algorithm #375
Comments
Where can I see your changes to PyGlossary? |
It is in the function I pasted in above my own code. |
You don't use this function in your repo. |
I took the format_group_content function I pasted in the linked file and replaced it with the version in the site-packages folder in my venv (I know, this is probably quite stupid, but at least it worked for me locally).
|
Oups, sorry, I really changed it:
|
I worked more on this and created a fork with the changes that allow setting a completely separate HTML for each word: https://github.com/Vuizur/pyglossary. The steps that would be left to get it working is finding a way to add a kindle generation option like "fix_kindle_not_finding_inflections" and then to execute a function that converts the input glossary to an intermediate "fixed" glossary, like it is done here: https://github.com/Vuizur/pyglossary-kindle-test/blob/master/pyglossary_kindle_test/edit_dictionary.py#L38 |
Can you create a pull request? |
Is your feature request related to a problem? Please describe.
Kindle's lookup algorithm has been implemented very badly. The first problem is that you can not turn off fuzzy lookup - the docs say that you can, but this applies only to inflections, not to headwords. The second (even worse) problem is that if the algorithm finds a result among the headwords, it stops searching. This combination leads to stupid behaviour: For example, if you look up the word "osó", which is the past of the Spanish word for "to dare", you only get the dictionary entry for "oso", which means "bear". And nothing else, even though your dictionary correctly contains "osó" as an inflection of "osar".
A related problem is that if a word is for example the inflection of two headwords, it only returns the first headword and ignores the second, which is also annoying.
Describe the solution you'd like
I have lost the hope that Amazon will ever fix these bugs, as they have apparently existed for more than 11 years. It is possible to create a dictionary that works around them: For each inflection that might conflict with another inflection or with another headword, you simply create a new headword with a duplicated definition. So for osó, we create the headword osó, set the headword HTML to (bolded) osar, and simply copy the definition content from osar.
The result is that we get a dictionary that is not really slower (as far as I could tell), but always finds all relevant headwords and is simply a much better experience.
I made an attempt to implement this in my function here. This solution works really well for the Spanish dictionary I generated. It currently uses unidecode, but this is a bad idea for languages other than Spanish, so that would have to be replaced by a generic function that simply removes all diacritics in a unicode string.
It also uses a replaced version of a Pyglossary function to support the setting of the headword HTML independently of the headword, but this "patching" is of course a hacky solution, so I don't know how one would properly model this to fit into the PyGlossary architecture. So if you give me some pointers I could also try to open a Pull Request.
The text was updated successfully, but these errors were encountered: