Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TokenClassificationPipeline support is_split_into_words tokeniser parameter #30757

Open
swtb3 opened this issue May 11, 2024 · 2 comments
Open
Labels
Core: Pipeline Internals of the library; Pipeline. Feature request Request for a new feature

Comments

@swtb3
Copy link

swtb3 commented May 11, 2024

Feature request

The TokenClassificationPipeline currently sets a hardcoded tokeniser config within it sanitiser method. This prevents users from passing their own config to the tokeniser.

It would be good to support some user input for tokeniser config. Especially for is_split_into_words as input data may be split already.

Motivation

It is common for token classification datasets to be split into words already so that they match their labels.

Your contribution

I naivley anticipate this being a simple change, so I am happy to submit a PR for it. Though it would first be nice to see a discussion surrounding the feature and if it fits with the goals of Transformers.

@amyeroberts
Copy link
Collaborator

cc @ArthurZucker @Rocketknight1

@amyeroberts amyeroberts added the Core: Pipeline Internals of the library; Pipeline. label May 13, 2024
@Rocketknight1
Copy link
Member

This makes sense to me, but I'm not super-familiar with that pipeline. I'd support a PR to allow some options to be passed through to the tokenizer, though, since that shouldn't have any backward compatibility issues!

@ArthurZucker ArthurZucker added the Feature request Request for a new feature label May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Core: Pipeline Internals of the library; Pipeline. Feature request Request for a new feature
Projects
None yet
Development

No branches or pull requests

4 participants