TextVectorization does not convert Cyrillic characters to lowercase #19668

Ybisalt · 2024-05-05T16:46:42Z

keras.layers.TextVectorization does not convert Cyrillic characters to lowercase with 'lower_and_strip_punctuation'.
Deprecated keras.preprocessing.text.Tokenizer does this.

#==========================================

from tensorflow.keras.layers import TextVectorization

tokenizer = TextVectorization(split='character', standardize='lower_and_strip_punctuation')

tokenizer.adapt(["Zz, Aa"])   # Latin
print(tokenizer.get_vocabulary())   # ['', '[UNK]', 'z', 'a', ' ']

tokenizer.adapt(["Яя, Аа"])   # Cyrillic
print(tokenizer.get_vocabulary())   # ['', '[UNK]', 'я', 'а', 'Я', 'А', ' ']

from tensorflow.keras.preprocessing.text import Tokenizer  # deprecated
tokenizer = Tokenizer(char_level=True)
tokenizer.fit_on_texts(["Яя, Аа"])   # Cyrillic
print(tokenizer.index_word)   # {1: 'я', 2: 'а', 3: ',', 4: ' '}

#==========================================

The text was updated successfully, but these errors were encountered:

fchollet · 2024-05-05T22:18:00Z

The lowercasing is simply does via the TensorFlow operation tf.strings.lower, and since it needs to be a TF op, we are not at liberty to change it. You could open the same issue on the TensorFlow repo instead. A workaround you could use is to expressing lowercasing via a regex and then use tf.strings.regex_replace, inside your own standardize function passed to TextVectorization.

Ybisalt · 2024-05-07T14:42:13Z

The lowercasing is simply does via the TensorFlow operation tf.strings.lower, and since it needs to be a TF op, we are not at liberty to change it. You could open the same issue on the TensorFlow repo instead.

It's not because of tf.strings.lower()!
tf.strings.lower() works properly with encoding='utf-8'.

t = tf.constant("Ff Zz Бб Яя")
print(t)     # tf.Tensor(b'Ff Zz \xd0\x91\xd0\xb1 \xd0\xaf\xd1\x8f', shape=(), dtype=string)

tl_1 = tf.strings.lower(t)
tl_2 = tf.strings.lower(t, encoding='utf-8')

print(tl_1.numpy().decode('utf-8'))     # ff zz Бб Яя
print(tl_2.numpy().decode('utf-8'))     # ff zz бб яя

By default: tf.keras.layers.TextVectorization(encoding='utf-8')
It's looks like TextVectorization does not pass the encoding to tf.strings.lower()

Ybisalt · 2024-05-19T17:40:02Z

Can someone check if TextVectorization does pass the encoding argument to tf.strings.lower() ?

github-actions bot assigned SuryanarayanaY May 5, 2024

SuryanarayanaY added backend:tensorflow stat:awaiting response from contributor labels May 6, 2024

google-ml-butler bot removed the stat:awaiting response from contributor label May 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TextVectorization does not convert Cyrillic characters to lowercase #19668

TextVectorization does not convert Cyrillic characters to lowercase #19668

Ybisalt commented May 5, 2024

fchollet commented May 5, 2024

Ybisalt commented May 7, 2024 •

edited

Ybisalt commented May 19, 2024

TextVectorization does not convert Cyrillic characters to lowercase #19668

TextVectorization does not convert Cyrillic characters to lowercase #19668

Comments

Ybisalt commented May 5, 2024

fchollet commented May 5, 2024

Ybisalt commented May 7, 2024 • edited

Ybisalt commented May 19, 2024

Ybisalt commented May 7, 2024 •

edited