Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Multi-Lingual Tokenization #147

Merged
merged 18 commits into from
May 21, 2024
Merged

Conversation

beme248
Copy link
Contributor

@beme248 beme248 commented Apr 2, 2024

No description provided.

pyproject.toml Outdated
@@ -54,7 +54,15 @@ processing = [
"tldextract",
"trafilatura>=1.8.0",
"tokenizers",
"ftfy"
"ftfy",
"stanza",
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suggest to keep only necessary dependency for Korean and also only code to support Korean besides English for now. It will make it easier to review the PR and we can easily add more Languages once the structure is agreed on.

}


def get_word_tokenizer(language: str):
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is currently dead code. Can we add a test that guarantees that the behavior for English does not change and then replace:

        from nltk.tokenize import word_tokenize

        words = word_tokenize(doc.text)

with automatic language specific tokenisation?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently, the tests in tests/pipeline/test_filters.py don't include language metadata. Including automatic language specific tokenization in the filters makes the tests fail. Should we also modify the tests to support the change?

@guipenedo guipenedo mentioned this pull request Apr 10, 2024
@guipenedo
Copy link
Collaborator

@vsabolcec @beme248 please let me know when you are ready for me to review

text = doc.text
words = word_tokenize(text) # TODO we should use language id filter
language = doc.metadata.get("language", "en")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To keep backward compatibility, English tokenizer is used if no language metadata is provided. We could also use multi-language tokenizer from Spacy or require that language metadata is present.

Comment on lines 25 to 29
def test_english_tokenizer(self):
nltk_words = word_tokenize(SAMPLE_TEXT, language="english")
tokenizer_words = default_tokenizer.tokenize(SAMPLE_TEXT, language=Languages.english)

self.assertEqual(nltk_words, tokenizer_words, "NLTK tokenizer and multilingual tokenizer differ")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test ensures that the "old" default tokenizer and "new" default tokenizer produce the same output. Is this test enough to ensure the correctness of the switch or do we also want to add integration tests?

Comment on lines +59 to +61
multilingual = [
"spacy",
]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We separate tokenization dependencies into a new group. In future, as more tokenization dependencies are added, we want to reduce the number of dependencies that the user downloads if they don't need all language tokenizers. Alternatively, tokenization dependencies could be moved to the "processing" group.

@vsabolcec
Copy link
Contributor

@guipenedo the PR is ready for review

Copy link
Collaborator

@guipenedo guipenedo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Some changes:

  • move word_tokenizers.py to utils (tools is more for CLI tools)
  • at some point we were planning to load the language from the metadata but have moved a bit to a more explicit approach: users should pass a language option to each block that uses tokenization of some sort. We believe this is less error prone than using the language in the metadata. The default value here should still be en
  • given the above, not sure having the MultilingualTokenizer makes a lot of sense
  • instead of a tokenizer option being passed to the blocks, users should instead pass the language option, and there could be a load_tokenizer function in word_tokenizers.py that would return the correct tokenizer (and possibly cache it on a dictionary or something so it can reused across blocks)
  • there are many other blocks using sent_tokenize and word_tokenize but we can change those later

@guipenedo guipenedo requested a review from hynky1999 May 21, 2024 15:56
Copy link
Contributor

@hynky1999 hynky1999 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!
Is there any need for the cache ?

@hynky1999
Copy link
Contributor

LGTM! Is there any need for the cache ?

@guipenedo just explained. Approved

@guipenedo guipenedo merged commit 71c77a4 into huggingface:main May 21, 2024
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants