Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding an option for kindle output to work around the bugs in their dictionary lookup algorithm #375

Open
Vuizur opened this issue May 16, 2022 · 7 comments

Comments

@Vuizur
Copy link

Vuizur commented May 16, 2022

Is your feature request related to a problem? Please describe.
Kindle's lookup algorithm has been implemented very badly. The first problem is that you can not turn off fuzzy lookup - the docs say that you can, but this applies only to inflections, not to headwords. The second (even worse) problem is that if the algorithm finds a result among the headwords, it stops searching. This combination leads to stupid behaviour: For example, if you look up the word "osó", which is the past of the Spanish word for "to dare", you only get the dictionary entry for "oso", which means "bear". And nothing else, even though your dictionary correctly contains "osó" as an inflection of "osar".
A related problem is that if a word is for example the inflection of two headwords, it only returns the first headword and ignores the second, which is also annoying.

Describe the solution you'd like
I have lost the hope that Amazon will ever fix these bugs, as they have apparently existed for more than 11 years. It is possible to create a dictionary that works around them: For each inflection that might conflict with another inflection or with another headword, you simply create a new headword with a duplicated definition. So for osó, we create the headword osó, set the headword HTML to (bolded) osar, and simply copy the definition content from osar.

The result is that we get a dictionary that is not really slower (as far as I could tell), but always finds all relevant headwords and is simply a much better experience.

I made an attempt to implement this in my function here. This solution works really well for the Spanish dictionary I generated. It currently uses unidecode, but this is a bad idea for languages other than Spanish, so that would have to be replaced by a generic function that simply removes all diacritics in a unicode string.

It also uses a replaced version of a Pyglossary function to support the setting of the headword HTML independently of the headword, but this "patching" is of course a hacky solution, so I don't know how one would properly model this to fit into the PyGlossary architecture. So if you give me some pointers I could also try to open a Pull Request.

@ilius
Copy link
Owner

ilius commented May 18, 2022

It also uses a replaced version of a Pyglossary function to support the setting of the headword HTML independently of the headword

Where can I see your changes to PyGlossary?
I can't find any fork on your account.

@Vuizur
Copy link
Author

Vuizur commented May 18, 2022

It is in the function I pasted in above my own code.

@ilius
Copy link
Owner

ilius commented May 20, 2022

You don't use this function in your repo.
And you seem to have changed GROUP_XHTML_WORD_DEFINITION_TEMPLATE, but again not in that repo.

@Vuizur
Copy link
Author

Vuizur commented May 20, 2022

I took the format_group_content function I pasted in the linked file and replaced it with the version in the site-packages folder in my venv (I know, this is probably quite stupid, but at least it worked for me locally).

I didn't change anything else I think. I checked GROUP_XHTML_WORD_DEFINITION_TEMPLATE and it is the same on my end as in the pyglossary current repo.

@Vuizur
Copy link
Author

Vuizur commented May 20, 2022

Oups, sorry, I really changed it:

	GROUP_XHTML_WORD_DEFINITION_TEMPLATE = """<idx:entry \
scriptable="yes"{spellcheck_str}>
<idx:orth{headword_hide}>{headword_html}{infl}
</idx:orth>
<br/>{definition}
</idx:entry>
<hr/>"""

@Vuizur
Copy link
Author

Vuizur commented Jul 26, 2022

I worked more on this and created a fork with the changes that allow setting a completely separate HTML for each word: https://github.com/Vuizur/pyglossary.
This fork should be fully compatible with the normal usage, only if a lemma/inflection list begins with the string "HTML_HEAD", the following HTML is displayed, and the next entry is then used as the value_headword.

The steps that would be left to get it working is finding a way to add a kindle generation option like "fix_kindle_not_finding_inflections" and then to execute a function that converts the input glossary to an intermediate "fixed" glossary, like it is done here: https://github.com/Vuizur/pyglossary-kindle-test/blob/master/pyglossary_kindle_test/edit_dictionary.py#L38
(I only have not found a way to iterate over the glossary data itself, so I used a list which has essentially the same structure.)
The project under https://github.com/Vuizur/pyglossary-kindle-test shows how to convert a tabfile to a a fixed kindle dictionary.

@ilius
Copy link
Owner

ilius commented Jul 30, 2022

Can you create a pull request?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants