Flexible normalization #14174

frankiedrake · 2024-02-15T14:53:53Z

Description

I noticed that when using Normalizer, if my sentence contains punktuation that is not surrounded by spaces, I get the words joined together.
For example:

"My dog is quite fast/furious and when hungry he can chew furniture,flowers and other things"
Becomes:
"My dog is quite fastfurious and when hungry he can chew furnitureflowers and other things"

I don't what way would be the most efficient, but would be good if we can somehow tune the behaviour of the normalizer. Despite this is quite easy step (for example preprocessing data with some regular expression) - this seems like a part of normalization and this is what we don't want to do before the actual (?) normalization

Preferred Solution

This can be some boolean parameter which will respect presence of spaces (adding them if needed) or maybe some cleanup stage that we can execute before the normalization?

Additional Context

The text was updated successfully, but these errors were encountered:

maziyarpanahi · 2024-02-15T15:14:54Z

Hi @frankiedrake
Good catch! I will look into this to see if there are some parameters already exist to respect this edge case, however, may I ask what was the initial requirement of using a Normalizer? (lowercase, cleaning, if yes, which parts if that example should be cleaned, etc.)

frankiedrake added the Feature request label Feb 15, 2024

frankiedrake assigned maziyarpanahi Feb 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Flexible normalization #14174

Flexible normalization #14174

frankiedrake commented Feb 15, 2024

maziyarpanahi commented Feb 15, 2024

Flexible normalization #14174

Flexible normalization #14174

Comments

frankiedrake commented Feb 15, 2024

Description

Preferred Solution

Additional Context

maziyarpanahi commented Feb 15, 2024