TokenClassificationPipeline support is_split_into_words tokeniser parameter #30757

swtb3 · 2024-05-11T11:15:38Z

Feature request

The TokenClassificationPipeline currently sets a hardcoded tokeniser config within it sanitiser method. This prevents users from passing their own config to the tokeniser.

It would be good to support some user input for tokeniser config. Especially for is_split_into_words as input data may be split already.

Motivation

It is common for token classification datasets to be split into words already so that they match their labels.

Your contribution

I naivley anticipate this being a simple change, so I am happy to submit a PR for it. Though it would first be nice to see a discussion surrounding the feature and if it fits with the goals of Transformers.

amyeroberts · 2024-05-13T09:17:03Z

cc @ArthurZucker @Rocketknight1

Rocketknight1 · 2024-05-13T13:47:07Z

This makes sense to me, but I'm not super-familiar with that pipeline. I'd support a PR to allow some options to be passed through to the tokenizer, though, since that shouldn't have any backward compatibility issues!

amyeroberts added the Core: Pipeline Internals of the library; Pipeline. label May 13, 2024

ArthurZucker added the Feature request Request for a new feature label May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TokenClassificationPipeline support is_split_into_words tokeniser parameter #30757

TokenClassificationPipeline support is_split_into_words tokeniser parameter #30757

swtb3 commented May 11, 2024

amyeroberts commented May 13, 2024

Rocketknight1 commented May 13, 2024

TokenClassificationPipeline support is_split_into_words tokeniser parameter #30757

TokenClassificationPipeline support is_split_into_words tokeniser parameter #30757

Comments

swtb3 commented May 11, 2024

Feature request

Motivation

Your contribution

amyeroberts commented May 13, 2024

Rocketknight1 commented May 13, 2024