[WIP] Multi-Lingual Tokenization #147

beme248 · 2024-04-02T20:39:27Z

No description provided.

Sync fork

beme248 · 2024-04-02T21:09:54Z

pyproject.toml

@@ -54,7 +54,15 @@ processing = [
    "tldextract",
    "trafilatura>=1.8.0",
    "tokenizers",
-    "ftfy"
+    "ftfy",
+    "stanza",


I suggest to keep only necessary dependency for Korean and also only code to support Korean besides English for now. It will make it easier to review the PR and we can easily add more Languages once the structure is agreed on.

beme248 · 2024-04-02T21:12:12Z

src/datatrove/tools/word_tokenizers.py

+}
+
+
+def get_word_tokenizer(language: str):


This code is currently dead code. Can we add a test that guarantees that the behavior for English does not change and then replace:

from nltk.tokenize import word_tokenize words = word_tokenize(doc.text)

with automatic language specific tokenisation?

Currently, the tests in tests/pipeline/test_filters.py don't include language metadata. Including automatic language specific tokenization in the filters makes the tests fail. Should we also modify the tests to support the change?

guipenedo · 2024-04-10T15:21:58Z

@vsabolcec @beme248 please let me know when you are ready for me to review

vsabolcec · 2024-04-10T15:27:42Z

src/datatrove/pipeline/filters/gopher_quality_filter.py

        text = doc.text
-        words = word_tokenize(text)  # TODO we should use language id filter
+        language = doc.metadata.get("language", "en")


To keep backward compatibility, English tokenizer is used if no language metadata is provided. We could also use multi-language tokenizer from Spacy or require that language metadata is present.

vsabolcec · 2024-04-10T15:35:00Z

tests/pipeline/test_word_tokenizers.py

+    def test_english_tokenizer(self):
+        nltk_words = word_tokenize(SAMPLE_TEXT, language="english")
+        tokenizer_words = default_tokenizer.tokenize(SAMPLE_TEXT, language=Languages.english)
+
+        self.assertEqual(nltk_words, tokenizer_words, "NLTK tokenizer and multilingual tokenizer differ")


This test ensures that the "old" default tokenizer and "new" default tokenizer produce the same output. Is this test enough to ensure the correctness of the switch or do we also want to add integration tests?

vsabolcec · 2024-04-11T12:13:19Z

pyproject.toml

+multilingual = [
+    "spacy",
+]


We separate tokenization dependencies into a new group. In future, as more tokenization dependencies are added, we want to reduce the number of dependencies that the user downloads if they don't need all language tokenizers. Alternatively, tokenization dependencies could be moved to the "processing" group.

vsabolcec · 2024-04-11T12:20:01Z

@guipenedo the PR is ready for review

guipenedo

Great work! Some changes:

move word_tokenizers.py to utils (tools is more for CLI tools)
at some point we were planning to load the language from the metadata but have moved a bit to a more explicit approach: users should pass a language option to each block that uses tokenization of some sort. We believe this is less error prone than using the language in the metadata. The default value here should still be en
given the above, not sure having the MultilingualTokenizer makes a lot of sense
instead of a tokenizer option being passed to the blocks, users should instead pass the language option, and there could be a load_tokenizer function in word_tokenizers.py that would return the correct tokenizer (and possibly cache it on a dictionary or something so it can reused across blocks)
there are many other blocks using sent_tokenize and word_tokenize but we can change those later

…okenizer

hynky1999

LGTM!
Is there any need for the cache ?

hynky1999 · 2024-05-21T16:02:13Z

LGTM! Is there any need for the cache ?

@guipenedo just explained. Approved

vsabolcec and others added 5 commits March 1, 2024 15:14

Merge pull request #2 from huggingface/main

dd77a00

Sync fork

Merge branch 'huggingface:main' into main

cc96799

Merge remote-tracking branch 'upstream/main' into pretokenization

d44a51c

Add tokenizers

5c6f1bc

Type fix

c4d6fce

beme248 commented Apr 2, 2024

View reviewed changes

vsabolcec and others added 7 commits April 3, 2024 10:09

Merge branch 'huggingface:main' into pretokenization

944fb21

English- and Korean-only tokenization

cca0e41

English tokenizer test

5933335

Add multilang tokenizer to Gopher quality filter

22cba4c

Require language metadata in Gopher quality

575d98f

Merge branch 'huggingface:main' into pretokenization

48377af

Lazy-load and separate dependencies

1e8c375

guipenedo mentioned this pull request Apr 10, 2024

Add more languages #149

Closed

vsabolcec reviewed Apr 10, 2024

View reviewed changes

vsabolcec added 3 commits April 11, 2024 09:39

Move top-level import to classes

e6b1ccf

Remove print in tests

93d5c2a

pyproject.toml: tokenization -> multilingual

005a3ef

vsabolcec reviewed Apr 11, 2024

View reviewed changes

Add sent_tokenize, and strip whitespaces

c679407

guipenedo requested changes May 21, 2024

View reviewed changes

Move word_tokenizers.py, remove MultilingualWordTokenizer, add load_t…

5c26475

…okenizer

guipenedo requested a review from hynky1999 May 21, 2024 15:56

guipenedo approved these changes May 21, 2024

View reviewed changes

hynky1999 reviewed May 21, 2024

View reviewed changes

hynky1999 approved these changes May 21, 2024

View reviewed changes

Add span_tokenize

8a08889

guipenedo merged commit 71c77a4 into huggingface:main May 21, 2024
4 checks passed

vsabolcec mentioned this pull request May 22, 2024

Add more word tokenizers #187

Merged

guipenedo mentioned this pull request May 24, 2024

Enhancing word_tokenize (like nltk) Support for Multiple Languages #135

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Multi-Lingual Tokenization #147

[WIP] Multi-Lingual Tokenization #147

beme248 commented Apr 2, 2024

beme248 Apr 2, 2024

beme248 Apr 2, 2024

vsabolcec Apr 3, 2024

guipenedo commented Apr 10, 2024

vsabolcec Apr 10, 2024

vsabolcec Apr 10, 2024

vsabolcec Apr 11, 2024

vsabolcec commented Apr 11, 2024

guipenedo left a comment •

edited

hynky1999 left a comment

hynky1999 commented May 21, 2024

[WIP] Multi-Lingual Tokenization #147

[WIP] Multi-Lingual Tokenization #147

Conversation

beme248 commented Apr 2, 2024

beme248 Apr 2, 2024

Choose a reason for hiding this comment

beme248 Apr 2, 2024

Choose a reason for hiding this comment

vsabolcec Apr 3, 2024

Choose a reason for hiding this comment

guipenedo commented Apr 10, 2024

vsabolcec Apr 10, 2024

Choose a reason for hiding this comment

vsabolcec Apr 10, 2024

Choose a reason for hiding this comment

vsabolcec Apr 11, 2024

Choose a reason for hiding this comment

vsabolcec commented Apr 11, 2024

guipenedo left a comment • edited

Choose a reason for hiding this comment

hynky1999 left a comment

Choose a reason for hiding this comment

hynky1999 commented May 21, 2024

guipenedo left a comment •

edited