-
-
Notifications
You must be signed in to change notification settings - Fork 476
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thai chars are in the wrong charsets #2103
Comments
@maximium Thanks for bringing this to our attention and for the PR. @Nick-S-2018 pls review the PR. |
@maximium The reason why Thai chars are currently in 'cjk' is that our 'cjk' charset, in fact, comprises the languages which don't use spaces between words. We've done that to help users to deal with setting ngram_chars for such languages. @sanikolaev It appears that the optimal solution is to move the Thai script to a separate charset that can be used along with cjk and non_cjk. |
Indeed. The problem with having thai in cjk is tokenizing every letter as a word, which is not correct and gives a lot of irrelevant results. The only way I see to handle thai text is to split text to separate words in app before indexing with some dictionary based tokenizer. And separate Thai charset fits perfectly for this. |
We've added a new Now we need to update the respective parts of our code and tests to match these changes: #2151 |
What's important is to make sure the older names are still accessible as aliases to the new ones (or vice versa). |
@Nick-S-2018 I don't quite understand the idea of what's done in the PR:
|
The breaking change related with |
Done in 536491f |
@klirichek pls continue with "alias non_cjk and non_cont in code" in this PR #2151 |
Bug Description:
I believe that Thai characters should be in the
non_cjk
charset, but not incjk
, because they use some kind of letters, but not logograms like Chinese.Currently I can see these Thai characters in
cjk
,chinese
,japanese
,korean
charsets.Manticore Search Version:
6.2.12, master
Operating System Version:
any
Have you tried the latest development version?
Internal Checklist:
To be completed by the assignee. Check off tasks that have been completed or are not applicable.
The text was updated successfully, but these errors were encountered: