Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Flexible normalization #14174

Open
frankiedrake opened this issue Feb 15, 2024 · 1 comment
Open

Flexible normalization #14174

frankiedrake opened this issue Feb 15, 2024 · 1 comment
Assignees

Comments

@frankiedrake
Copy link

Description

I noticed that when using Normalizer, if my sentence contains punktuation that is not surrounded by spaces, I get the words joined together.
For example:

"My dog is quite fast/furious and when hungry he can chew furniture,flowers and other things"
Becomes:
"My dog is quite fastfurious and when hungry he can chew furnitureflowers and other things"

I don't what way would be the most efficient, but would be good if we can somehow tune the behaviour of the normalizer. Despite this is quite easy step (for example preprocessing data with some regular expression) - this seems like a part of normalization and this is what we don't want to do before the actual (?) normalization

Preferred Solution

This can be some boolean parameter which will respect presence of spaces (adding them if needed) or maybe some cleanup stage that we can execute before the normalization?

Additional Context

@maziyarpanahi
Copy link
Member

Hi @frankiedrake
Good catch! I will look into this to see if there are some parameters already exist to respect this edge case, however, may I ask what was the initial requirement of using a Normalizer? (lowercase, cleaning, if yes, which parts if that example should be cleaned, etc.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants