You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I noticed that when using Normalizer, if my sentence contains punktuation that is not surrounded by spaces, I get the words joined together.
For example:
"My dog is quite fast/furious and when hungry he can chew furniture,flowers and other things"
Becomes:
"My dog is quite fastfurious and when hungry he can chew furnitureflowers and other things"
I don't what way would be the most efficient, but would be good if we can somehow tune the behaviour of the normalizer. Despite this is quite easy step (for example preprocessing data with some regular expression) - this seems like a part of normalization and this is what we don't want to do before the actual (?) normalization
Preferred Solution
This can be some boolean parameter which will respect presence of spaces (adding them if needed) or maybe some cleanup stage that we can execute before the normalization?
Additional Context
The text was updated successfully, but these errors were encountered:
Hi @frankiedrake
Good catch! I will look into this to see if there are some parameters already exist to respect this edge case, however, may I ask what was the initial requirement of using a Normalizer? (lowercase, cleaning, if yes, which parts if that example should be cleaned, etc.)
Description
I noticed that when using
Normalizer
, if my sentence contains punktuation that is not surrounded by spaces, I get the words joined together.For example:
"My dog is quite fast/furious and when hungry he can chew furniture,flowers and other things"
Becomes:
"My dog is quite fastfurious and when hungry he can chew furnitureflowers and other things"
I don't what way would be the most efficient, but would be good if we can somehow tune the behaviour of the normalizer. Despite this is quite easy step (for example preprocessing data with some regular expression) - this seems like a part of normalization and this is what we don't want to do before the actual (?) normalization
Preferred Solution
This can be some boolean parameter which will respect presence of spaces (adding them if needed) or maybe some cleanup stage that we can execute before the normalization?
Additional Context
The text was updated successfully, but these errors were encountered: