Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use a trie to speed up index construction #887

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from
Draft

Conversation

lapp0
Copy link
Contributor

@lapp0 lapp0 commented May 10, 2024

Replaces #507

Fixes #795

Awaiting #904 before I continue working on this

In regex.py Outlines compiles an index of legal tokens for each state of the FSM (state_scan_tokens)

On the main branch we use this naive approach

  • for each state S_n in the FSM
    • for each token in vocabulary
      • simulate FSM traversal character by character starting with S_n, if successful, add the token to to S_n

Calling _walk_fsm once for every num_tokens * num_states_per_token is inefficient and the current bottleneck in index construction.

We can improve this by using a Trie (implementation details to come)

@rlouf
Copy link
Member

rlouf commented May 10, 2024

Could you add a high-level description of what the PR does so the PR is self-contained? Is there an issue we could link to this PR? Also there is no need to add "WIP" to the title, this is what "Draft PR" means :)

@rlouf rlouf changed the title WIP: Vocab Trie To Speed Up regex.py Use a trie to speed up index construction May 10, 2024
@lapp0
Copy link
Contributor Author

lapp0 commented May 15, 2024

I believe I've found a bug in regex.py's reduced_vocabulary()

For the token 188 in the gpt2 tokenizer ('\x00'), token_tuple_np is empty (array([''], dtype='<U2')), however it isn't added to empty_token_ids.

Edit: appears it's being addressed in #904

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Todo
Development

Successfully merging this pull request may close these issues.

Accelerate the index construction process
2 participants