Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TextVectorization does not convert Cyrillic characters to lowercase #19668

Open
Ybisalt opened this issue May 5, 2024 · 3 comments
Open

TextVectorization does not convert Cyrillic characters to lowercase #19668

Ybisalt opened this issue May 5, 2024 · 3 comments
Assignees

Comments

@Ybisalt
Copy link

Ybisalt commented May 5, 2024

keras.layers.TextVectorization does not convert Cyrillic characters to lowercase with 'lower_and_strip_punctuation'.
Deprecated keras.preprocessing.text.Tokenizer does this.

#==========================================

from tensorflow.keras.layers import TextVectorization

tokenizer = TextVectorization(split='character', standardize='lower_and_strip_punctuation')

tokenizer.adapt(["Zz, Aa"])   # Latin
print(tokenizer.get_vocabulary())   # ['', '[UNK]', 'z', 'a', ' ']

tokenizer.adapt(["Яя, Аа"])   # Cyrillic
print(tokenizer.get_vocabulary())   # ['', '[UNK]', 'я', 'а', 'Я', 'А', ' ']

from tensorflow.keras.preprocessing.text import Tokenizer  # deprecated
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(["Яя, Аа"])   # Cyrillic
print(tokenizer.index_word)   # {1: 'я', 2: 'а', 3: ',', 4: ' '}

#==========================================
@fchollet
Copy link
Member

fchollet commented May 5, 2024

The lowercasing is simply does via the TensorFlow operation tf.strings.lower, and since it needs to be a TF op, we are not at liberty to change it. You could open the same issue on the TensorFlow repo instead. A workaround you could use is to expressing lowercasing via a regex and then use tf.strings.regex_replace, inside your own standardize function passed to TextVectorization.

@Ybisalt
Copy link
Author

Ybisalt commented May 7, 2024

The lowercasing is simply does via the TensorFlow operation tf.strings.lower, and since it needs to be a TF op, we are not at liberty to change it. You could open the same issue on the TensorFlow repo instead.

It's not because of tf.strings.lower()!
tf.strings.lower() works properly with encoding='utf-8'.

t = tf.constant("Ff Zz Бб Яя")
print(t)     # tf.Tensor(b'Ff Zz \xd0\x91\xd0\xb1 \xd0\xaf\xd1\x8f', shape=(), dtype=string)

tl_1 = tf.strings.lower(t)
tl_2 = tf.strings.lower(t, encoding='utf-8')

print(tl_1.numpy().decode('utf-8'))     # ff zz Бб Яя
print(tl_2.numpy().decode('utf-8'))     # ff zz бб яя

By default: tf.keras.layers.TextVectorization(encoding='utf-8')
It's looks like TextVectorization does not pass the encoding to tf.strings.lower()

@Ybisalt
Copy link
Author

Ybisalt commented May 19, 2024

Can someone check if TextVectorization does pass the encoding argument to tf.strings.lower() ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants