Update create_states_mapping function to include tokenizer parameter #873

br3no · 2024-05-07T11:02:51Z

Fixes #872

brandonwillard

I don't know how well the tokenizer instance will hash/serialize. We need to at least make sure that there's a test confirming that we get cache hits when appropriate.

br3no · 2024-05-07T18:27:28Z

The class inherits Hashable

outlines/outlines/models/tokenizer.py

Line 7 in 4f8433d

class Tokenizer(Hashable, Protocol):

But yeah, a test would be good. Is there somewhere a documentation for contributors that could help me setup the dev env?

brandonwillard · 2024-05-07T18:41:51Z

The class inherits Hashable

outlines/outlines/models/tokenizer.py

Line 7 in 4f8433d

class Tokenizer(Hashable, Protocol):

But yeah, a test would be good.

Yeah, we have some tests for in-session hash consistency (e.g. here), but I recall some issues with cross-session/serialization consistency. That might not be the case anymore, though.

Is there somewhere a documentation for contributors that could help me setup the dev env?

https://outlines-dev.github.io/outlines/community/contribute/

ekagra-ranjan · 2024-05-07T23:51:13Z

Passing the tokenizer directly to cache key will not work since each instance of the same tokenizer object will be hashed to different values so there will be cache miss. I didnt see this PR earlier so I raised a PR with my local fix in #876 which caches based on tokenizer name or path string.

br3no · 2024-05-10T10:12:30Z

I've changed the PR to explicitly set the cache key computation using the key_function parameter of the @cache() decorator. This is in line with the discussion in #876.

@brandonwillard what do you think?

brandonwillard · 2024-05-10T12:45:10Z

I've changed the PR to explicitly set the cache key computation using the key_function parameter of the @cache() decorator. This is in line with the discussion in #876.

@brandonwillard what do you think?

We need tests that confirm that the hashing function works as intended.

br3no · 2024-05-10T15:07:33Z

@brandonwillard I have added unit tests for the cache decorator when used in conjunction with the key_function. Let me know if this is sufficient for you.

brandonwillard

@brandonwillard I have added unit tests for the cache decorator when used in conjunction with the key_function. Let me know if this is sufficient for you.

I like where you're going with those tests! I was originally referring to tests for the choice of key/hash function for the tokenizers, though.

We need to confirm that this choice of key will work between sessions and across different tokenizer types (at the very least the transformer types). Since you're using all the members/fields of the generic Tokenizer interface, it's probably fine; if not, we need to update the interface and/or force subclasses to implement this key function themselves.

Speaking of which, we should first consider/try updating the __hash__ and __eq__ methods of Tokenizerto follow this key. That would make things a lot more consistent; however, for the few times we need to compute a hash, it will be a little expensive. Since the interface already implies some form of immutability via Hashable, and the tokenizers should be immutable in this scenario conceptually, we could reasonably cache the hash value and mostly remove that concern.

Anyway, if we fix Tokenizer, then we only need to add that extra argument to create_states_mapping and a direct test guaranteeing that the disk cache is hit for—say—an example using GPT2's tokenizer (at the very least).

br3no · 2024-05-11T08:05:09Z

We need to confirm that this choice of key will work between sessions and across different tokenizer types (at the very least the transformer types). Since you're using all the members/fields of the generic Tokenizer interface, it's probably fine;

The key_function will return the fields that, together, "define" a particular tokenizer (these fields are all used in later code, so we can safely assume that it's okay to use them for this purpose). Assuming the hash is computed correctly, I believe we can also safely assume that the caching will work since we verified in a unit test, that the key_function is used the way it should.

if not, we need to update the interface and/or force subclasses to implement this key function themselves.

I agree this would be the best solution, but this would require a large refactoring. The different integrations work differently with the tokenizers. The only implementation of the Tokenizer protocol class is

outlines/outlines/models/transformers.py

Line 57 in 97ec37d

class TransformerTokenizer(Tokenizer):

For vLLM the tokenizer is patched with the right methods and members. In this case, we would need to also patch the hashing function into the tokenizer. I think this is not something we should do.

Speaking of which, we should first consider/try updating the __hash__ and __eq__ methods of Tokenizerto follow this key.

The situation described above is also the reason why this is, unfortunately, no solution at the moment.

I haven't looked into the matter in depth, but I believe there are some refactoring opportunities to consolidate the usage of tokenizers, aligned with what is done in the transformers integration. If all integrations would provide implementations of the Tokenizer protocol, then we could make the hash computation much simpler and also easier to test.

I opened the original issue because I realized that the cache uses only the regex as key, leading to errors because different tokenizers share the same state machine. Since the cache is persistent, I believe many users will face this issue. The symptoms of this problem are hard to read; the LLMs will generate gibberish.

I think the right thing to do now would be to merge this PR to make sure people don't get into this problem. This is not a new feature, it's a bug. And then open a new issue to address the refactoring needs described above and maybe clean up the change introduced in this PR.

brandonwillard

I've created a PR (i.e. #911) that synthesizes the things we've talked about and outlines the kind of test we need before merging anything that closes #872.

The outlines.caching.cache tests here are useful, so we can update this PR so that it only introduces those changes (in one separate commit), or we can merge the changes from #911 into this PR if you want to finish that work here.

br3no · 2024-05-23T08:27:50Z

@brandonwillard great work on #911. If I read it correctly, there is no longer a key_function argument in the cache decorator, right? So the tests here don't make sense if #911 is merged.

I don't mind closing this PR; I'd just like to have the behavior fixed. #911 seems like a better increment than this PR here.

Update create_states_mapping function to include tokenizer parameter

08ff5b4

brandonwillard reviewed May 7, 2024

View reviewed changes

br3no added 2 commits May 10, 2024 11:39

Merge branch 'outlines-dev:main' into main

8661f14

refactor caching key computation

c4706d8

correcting the key_function and improving cache docstrig

e94a1a0

br3no added 2 commits May 10, 2024 16:47

add test for custom key_function

f3209ef

add test comments

d0d94d4

brandonwillard reviewed May 10, 2024

View reviewed changes

brandonwillard reviewed May 22, 2024

View reviewed changes

brandonwillard closed this May 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update create_states_mapping function to include tokenizer parameter #873

Update create_states_mapping function to include tokenizer parameter #873

br3no commented May 7, 2024

brandonwillard left a comment

br3no commented May 7, 2024

brandonwillard commented May 7, 2024

ekagra-ranjan commented May 7, 2024 •

edited

br3no commented May 10, 2024

brandonwillard commented May 10, 2024

br3no commented May 10, 2024

brandonwillard left a comment •

edited

br3no commented May 11, 2024

brandonwillard left a comment •

edited

br3no commented May 23, 2024

Update create_states_mapping function to include tokenizer parameter #873

Update create_states_mapping function to include tokenizer parameter #873

Conversation

br3no commented May 7, 2024

brandonwillard left a comment

Choose a reason for hiding this comment

br3no commented May 7, 2024

brandonwillard commented May 7, 2024

ekagra-ranjan commented May 7, 2024 • edited

br3no commented May 10, 2024

brandonwillard commented May 10, 2024

br3no commented May 10, 2024

brandonwillard left a comment • edited

Choose a reason for hiding this comment

br3no commented May 11, 2024

brandonwillard left a comment • edited

Choose a reason for hiding this comment

br3no commented May 23, 2024

ekagra-ranjan commented May 7, 2024 •

edited

brandonwillard left a comment •

edited

brandonwillard left a comment •

edited