Use a trie to speed up index construction #887

lapp0 · 2024-05-10T12:12:06Z

Replaces #507

Fixes #795

Awaiting #904 before I continue working on this

In regex.py Outlines compiles an index of legal tokens for each state of the FSM (state_scan_tokens)

On the main branch we use this naive approach

for each state S_n in the FSM
- for each token in vocabulary
  - simulate FSM traversal character by character starting with S_n, if successful, add the token to to S_n

Calling _walk_fsm once for every num_tokens * num_states_per_token is inefficient and the current bottleneck in index construction.

We can improve this by using a Trie (implementation details to come)

rlouf · 2024-05-10T13:10:14Z

Could you add a high-level description of what the PR does so the PR is self-contained? Is there an issue we could link to this PR? Also there is no need to add "WIP" to the title, this is what "Draft PR" means :)

lapp0 · 2024-05-15T20:31:58Z

I believe I've found a bug in regex.py's reduced_vocabulary()

For the token 188 in the gpt2 tokenizer ('\x00'), token_tuple_np is empty (array([''], dtype='<U2')), however it isn't added to empty_token_ids.

Edit: appears it's being addressed in #904

re-introduce vocab trie to optimize index construction

955ee4e

lapp0 mentioned this pull request May 10, 2024

Use a trie for scanning during index construction #507

Closed

misc fixes

2644ddc

rlouf changed the title ~~WIP: Vocab Trie To Speed Up regex.py~~ Use a trie to speed up index construction May 10, 2024

rlouf assigned lapp0 May 10, 2024

Andrew Lapp added 6 commits May 11, 2024 21:48

add vocab_trie

3bead88

functioning vocab trie, byte capabilities WIP

6e183f3

revert to previous FSM traversal code to minimize diff

269935d

add unicode char seq dict

d0bc17b

checkpoint

edab6d7

working byte-token vocab trie

392288e

Andrew Lapp added 2 commits May 15, 2024 15:45

fix null byte error

d2066ad

ensure null bytes added to empty_token_ids

884f921

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use a trie to speed up index construction #887

Use a trie to speed up index construction #887

lapp0 commented May 10, 2024 •

edited

rlouf commented May 10, 2024 •

edited

lapp0 commented May 15, 2024 •

edited

Use a trie to speed up index construction #887

Are you sure you want to change the base?

Use a trie to speed up index construction #887

Conversation

lapp0 commented May 10, 2024 • edited

rlouf commented May 10, 2024 • edited

lapp0 commented May 15, 2024 • edited

lapp0 commented May 10, 2024 •

edited

rlouf commented May 10, 2024 •

edited

lapp0 commented May 15, 2024 •

edited