Adding an option for kindle output to work around the bugs in their dictionary lookup algorithm #375

Vuizur · 2022-05-16T10:46:48Z

Is your feature request related to a problem? Please describe.
Kindle's lookup algorithm has been implemented very badly. The first problem is that you can not turn off fuzzy lookup - the docs say that you can, but this applies only to inflections, not to headwords. The second (even worse) problem is that if the algorithm finds a result among the headwords, it stops searching. This combination leads to stupid behaviour: For example, if you look up the word "osó", which is the past of the Spanish word for "to dare", you only get the dictionary entry for "oso", which means "bear". And nothing else, even though your dictionary correctly contains "osó" as an inflection of "osar".
A related problem is that if a word is for example the inflection of two headwords, it only returns the first headword and ignores the second, which is also annoying.

Describe the solution you'd like
I have lost the hope that Amazon will ever fix these bugs, as they have apparently existed for more than 11 years. It is possible to create a dictionary that works around them: For each inflection that might conflict with another inflection or with another headword, you simply create a new headword with a duplicated definition. So for osó, we create the headword osó, set the headword HTML to (bolded) osar, and simply copy the definition content from osar.

The result is that we get a dictionary that is not really slower (as far as I could tell), but always finds all relevant headwords and is simply a much better experience.

I made an attempt to implement this in my function here. This solution works really well for the Spanish dictionary I generated. It currently uses unidecode, but this is a bad idea for languages other than Spanish, so that would have to be replaced by a generic function that simply removes all diacritics in a unicode string.

It also uses a replaced version of a Pyglossary function to support the setting of the headword HTML independently of the headword, but this "patching" is of course a hacky solution, so I don't know how one would properly model this to fit into the PyGlossary architecture. So if you give me some pointers I could also try to open a Pull Request.

ilius · 2022-05-18T04:29:06Z

It also uses a replaced version of a Pyglossary function to support the setting of the headword HTML independently of the headword

Where can I see your changes to PyGlossary?
I can't find any fork on your account.

Vuizur · 2022-05-18T08:10:07Z

It is in the function I pasted in above my own code.

ilius · 2022-05-20T07:28:15Z

You don't use this function in your repo.
And you seem to have changed GROUP_XHTML_WORD_DEFINITION_TEMPLATE, but again not in that repo.

Vuizur · 2022-05-20T08:28:53Z

I took the format_group_content function I pasted in the linked file and replaced it with the version in the site-packages folder in my venv (I know, this is probably quite stupid, but at least it worked for me locally).

~~I didn't change anything else I think. I checked GROUP_XHTML_WORD_DEFINITION_TEMPLATE and it is the same on my end as in the pyglossary current repo.~~

Vuizur · 2022-05-20T11:51:33Z

Oups, sorry, I really changed it:

	GROUP_XHTML_WORD_DEFINITION_TEMPLATE = """<idx:entry \
scriptable="yes"{spellcheck_str}>
<idx:orth{headword_hide}>{headword_html}{infl}
</idx:orth>
<br/>{definition}
</idx:entry>
<hr/>"""

Vuizur · 2022-07-26T13:51:02Z

I worked more on this and created a fork with the changes that allow setting a completely separate HTML for each word: https://github.com/Vuizur/pyglossary.
This fork should be fully compatible with the normal usage, only if a lemma/inflection list begins with the string "HTML_HEAD", the following HTML is displayed, and the next entry is then used as the value_headword.

The steps that would be left to get it working is finding a way to add a kindle generation option like "fix_kindle_not_finding_inflections" and then to execute a function that converts the input glossary to an intermediate "fixed" glossary, like it is done here: https://github.com/Vuizur/pyglossary-kindle-test/blob/master/pyglossary_kindle_test/edit_dictionary.py#L38
(I only have not found a way to iterate over the glossary data itself, so I used a list which has essentially the same structure.)
The project under https://github.com/Vuizur/pyglossary-kindle-test shows how to convert a tabfile to a a fixed kindle dictionary.

ilius · 2022-07-30T15:49:56Z

Can you create a pull request?

Vuizur mentioned this issue Jun 3, 2022

Help: Code that fixes the issue of inflections on Kindle dictionaries Vuizur/ebook_dictionary_creator#1

Open

ilius added a commit that referenced this issue Jan 16, 2023

WIP: ebook_mobi.py: issue #375

316cb6c

ilius added Improvement Feature labels Jan 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Adding an option for kindle output to work around the bugs in their dictionary lookup algorithm #375

Adding an option for kindle output to work around the bugs in their dictionary lookup algorithm #375

Vuizur commented May 16, 2022

ilius commented May 18, 2022 •

edited

Vuizur commented May 18, 2022

ilius commented May 20, 2022 •

edited

Vuizur commented May 20, 2022 •

edited

Vuizur commented May 20, 2022

Vuizur commented Jul 26, 2022

ilius commented Jul 30, 2022

Adding an option for kindle output to work around the bugs in their dictionary lookup algorithm #375

Adding an option for kindle output to work around the bugs in their dictionary lookup algorithm #375

Comments

Vuizur commented May 16, 2022

ilius commented May 18, 2022 • edited

Vuizur commented May 18, 2022

ilius commented May 20, 2022 • edited

Vuizur commented May 20, 2022 • edited

Vuizur commented May 20, 2022

Vuizur commented Jul 26, 2022

ilius commented Jul 30, 2022

ilius commented May 18, 2022 •

edited

ilius commented May 20, 2022 •

edited

Vuizur commented May 20, 2022 •

edited