Unknown PostProcessor type: Sequence #739

zcbenz · 2024-05-06T00:48:50Z

System Info

Using Node.js 20 with transformers.js 2.17.1.

Environment/Platform

Description

It seems that following post preprocessor in tokenizer.json is not supported:

  "post_processor": {
    "type": "Sequence",
    "processors": [
      {
        "type": "ByteLevel",
        "add_prefix_space": true,
        "trim_offsets": false,
        "use_regex": true
      },
      {
        "type": "TemplateProcessing",
        "single": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },
          {
            "Sequence": {
              "id": "A",
              "type_id": 0
            }
          }
        ],
        "pair": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },
          {
            "Sequence": {
              "id": "A",
              "type_id": 0
            }
          },
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 1
            }
          },
          {
            "Sequence": {
              "id": "B",
              "type_id": 1
            }
          }
        ],
        "special_tokens": {
          "<|begin_of_text|>": {
            "id": "<|begin_of_text|>",
            "ids": [
              128000
            ],
            "tokens": [
              "<|begin_of_text|>"
            ]
          }
        }
      }
    ]
  },

Reproduction

import {AutoTokenizer} from '@xenova/transformers'

AutoTokenizer.from_pretrained('yujiepan/llama-3-tiny-random')

Throws error:

                throw new Error(`Unknown PostProcessor type: ${config.type}`);
                      ^

Error: Unknown PostProcessor type: Sequence
    at PostProcessor.fromConfig (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:1596:23)
    at new PreTrainedTokenizer (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:2465:45)
    at AutoTokenizer.from_pretrained (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:4424:16)

The text was updated successfully, but these errors were encountered:

xenova · 2024-05-06T01:37:54Z

Hi there 👋 Thanks for the report!

Luckily, we already support the ByteLevel and TemplateProcessing post-processors, so the only thing needed is to implement the Sequence post-processor.

Similarly, we already support sequences of normalizers, decoders, and pre-tokenizers, and a similar pattern can be adapted for post-processors. Is this something you'd be interested in adding? If so, I'd be happy to review a PR.

zcbenz · 2024-05-06T01:50:47Z

Sorry I don't plan to work on this issue, I'm just reporting a random issue I met.

xenova · 2024-05-06T02:20:18Z

No worries! It's super simple, so I'll add it soon. Thanks again for reporting!

domleboss97 · 2024-05-13T13:50:51Z

@xenova I am also encountering this issue. I was going to take a pass at it, but I don't understand the internals well enough to understand how to meaningfully accumulate the token_type_ids generated by post processors (ref).

domleboss97 · 2024-05-14T19:07:22Z

@xenova Any thoughts on this? This is preventing loading llama 3 8b, which is a bummer.

xenova · 2024-05-14T22:47:53Z

Here's the rust code for it: https://github.com/huggingface/tokenizers/blob/25aee8b88c8de3c5a52e2f9cb6281d6df00ad516/tokenizers/src/processors/sequence.rs#L18-L36 and it should be easy to translate into JS.

xenova · 2024-05-23T01:12:38Z

I added support for it in #771. See here for example usage.

zcbenz added the bug Something isn't working label May 6, 2024

xenova linked a pull request May 23, 2024 that will close this issue

Add sequence post processor #771

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unknown PostProcessor type: Sequence #739

Unknown PostProcessor type: Sequence #739

zcbenz commented May 6, 2024

xenova commented May 6, 2024

zcbenz commented May 6, 2024

xenova commented May 6, 2024

domleboss97 commented May 13, 2024

domleboss97 commented May 14, 2024

xenova commented May 14, 2024 •

edited

xenova commented May 23, 2024

Unknown PostProcessor type: Sequence #739

Unknown PostProcessor type: Sequence #739

Comments

zcbenz commented May 6, 2024

System Info

Environment/Platform

Description

Reproduction

xenova commented May 6, 2024

zcbenz commented May 6, 2024

xenova commented May 6, 2024

domleboss97 commented May 13, 2024

domleboss97 commented May 14, 2024

xenova commented May 14, 2024 • edited

xenova commented May 23, 2024

xenova commented May 14, 2024 •

edited