Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unknown PostProcessor type: Sequence #739

Open
1 of 5 tasks
zcbenz opened this issue May 6, 2024 · 7 comments · May be fixed by #771
Open
1 of 5 tasks

Unknown PostProcessor type: Sequence #739

zcbenz opened this issue May 6, 2024 · 7 comments · May be fixed by #771
Labels
bug Something isn't working

Comments

@zcbenz
Copy link

zcbenz commented May 6, 2024

System Info

Using Node.js 20 with transformers.js 2.17.1.

Environment/Platform

  • Website/web-app
  • Browser extension
  • Server-side (e.g., Node.js, Deno, Bun)
  • Desktop app (e.g., Electron)
  • Other (e.g., VSCode extension)

Description

It seems that following post preprocessor in tokenizer.json is not supported:

  "post_processor": {
    "type": "Sequence",
    "processors": [
      {
        "type": "ByteLevel",
        "add_prefix_space": true,
        "trim_offsets": false,
        "use_regex": true
      },
      {
        "type": "TemplateProcessing",
        "single": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },
          {
            "Sequence": {
              "id": "A",
              "type_id": 0
            }
          }
        ],
        "pair": [
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 0
            }
          },
          {
            "Sequence": {
              "id": "A",
              "type_id": 0
            }
          },
          {
            "SpecialToken": {
              "id": "<|begin_of_text|>",
              "type_id": 1
            }
          },
          {
            "Sequence": {
              "id": "B",
              "type_id": 1
            }
          }
        ],
        "special_tokens": {
          "<|begin_of_text|>": {
            "id": "<|begin_of_text|>",
            "ids": [
              128000
            ],
            "tokens": [
              "<|begin_of_text|>"
            ]
          }
        }
      }
    ]
  },

Reproduction

import {AutoTokenizer} from '@xenova/transformers'

AutoTokenizer.from_pretrained('yujiepan/llama-3-tiny-random')

Throws error:

                throw new Error(`Unknown PostProcessor type: ${config.type}`);
                      ^

Error: Unknown PostProcessor type: Sequence
    at PostProcessor.fromConfig (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:1596:23)
    at new PreTrainedTokenizer (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:2465:45)
    at AutoTokenizer.from_pretrained (file:///private/tmp/tinyllama/node_modules/@xenova/transformers/src/tokenizers.js:4424:16)
@zcbenz zcbenz added the bug Something isn't working label May 6, 2024
@xenova
Copy link
Owner

xenova commented May 6, 2024

Hi there 👋 Thanks for the report!

Luckily, we already support the ByteLevel and TemplateProcessing post-processors, so the only thing needed is to implement the Sequence post-processor.

Similarly, we already support sequences of normalizers, decoders, and pre-tokenizers, and a similar pattern can be adapted for post-processors. Is this something you'd be interested in adding? If so, I'd be happy to review a PR.

@zcbenz
Copy link
Author

zcbenz commented May 6, 2024

Sorry I don't plan to work on this issue, I'm just reporting a random issue I met.

@xenova
Copy link
Owner

xenova commented May 6, 2024

No worries! It's super simple, so I'll add it soon. Thanks again for reporting!

@domleboss97
Copy link

@xenova I am also encountering this issue. I was going to take a pass at it, but I don't understand the internals well enough to understand how to meaningfully accumulate the token_type_ids generated by post processors (ref).

@domleboss97
Copy link

@xenova Any thoughts on this? This is preventing loading llama 3 8b, which is a bummer.

@xenova
Copy link
Owner

xenova commented May 14, 2024

Here's the rust code for it: https://github.com/huggingface/tokenizers/blob/25aee8b88c8de3c5a52e2f9cb6281d6df00ad516/tokenizers/src/processors/sequence.rs#L18-L36 and it should be easy to translate into JS.

@xenova xenova linked a pull request May 23, 2024 that will close this issue
@xenova
Copy link
Owner

xenova commented May 23, 2024

I added support for it in #771. See here for example usage.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants