-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Multi-Lingual Tokenization #147
Conversation
pyproject.toml
Outdated
@@ -54,7 +54,15 @@ processing = [ | |||
"tldextract", | |||
"trafilatura>=1.8.0", | |||
"tokenizers", | |||
"ftfy" | |||
"ftfy", | |||
"stanza", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest to keep only necessary dependency for Korean and also only code to support Korean besides English for now. It will make it easier to review the PR and we can easily add more Languages once the structure is agreed on.
} | ||
|
||
|
||
def get_word_tokenizer(language: str): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is currently dead code
. Can we add a test that guarantees that the behavior for English does not change and then replace:
from nltk.tokenize import word_tokenize
words = word_tokenize(doc.text)
with automatic language specific tokenisation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the tests in tests/pipeline/test_filters.py
don't include language metadata. Including automatic language specific tokenization in the filters makes the tests fail. Should we also modify the tests to support the change?
@vsabolcec @beme248 please let me know when you are ready for me to review |
text = doc.text | ||
words = word_tokenize(text) # TODO we should use language id filter | ||
language = doc.metadata.get("language", "en") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To keep backward compatibility, English tokenizer is used if no language metadata is provided. We could also use multi-language tokenizer from Spacy or require that language metadata is present.
def test_english_tokenizer(self): | ||
nltk_words = word_tokenize(SAMPLE_TEXT, language="english") | ||
tokenizer_words = default_tokenizer.tokenize(SAMPLE_TEXT, language=Languages.english) | ||
|
||
self.assertEqual(nltk_words, tokenizer_words, "NLTK tokenizer and multilingual tokenizer differ") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test ensures that the "old" default tokenizer and "new" default tokenizer produce the same output. Is this test enough to ensure the correctness of the switch or do we also want to add integration tests?
multilingual = [ | ||
"spacy", | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We separate tokenization dependencies into a new group. In future, as more tokenization dependencies are added, we want to reduce the number of dependencies that the user downloads if they don't need all language tokenizers. Alternatively, tokenization dependencies could be moved to the "processing" group.
@guipenedo the PR is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! Some changes:
- move word_tokenizers.py to
utils
(tools
is more for CLI tools) - at some point we were planning to load the language from the metadata but have moved a bit to a more explicit approach: users should pass a
language
option to each block that uses tokenization of some sort. We believe this is less error prone than using the language in the metadata. The default value here should still been
- given the above, not sure having the MultilingualTokenizer makes a lot of sense
- instead of a
tokenizer
option being passed to the blocks, users should instead pass thelanguage
option, and there could be aload_tokenizer
function in word_tokenizers.py that would return the correct tokenizer (and possibly cache it on a dictionary or something so it can reused across blocks) - there are many other blocks using sent_tokenize and word_tokenize but we can change those later
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM!
Is there any need for the cache ?
@guipenedo just explained. Approved |
No description provided.