Llama3 tokenizer with Incorrect offset_mapping #1517

justin-shao · 2024-04-27T01:33:56Z

When tokenizing with the llama-3 tokenizer in tandem with return_offsets_mapping=True, the resulting offset_mapping does not align with the behavior outlined in docs.

Example:

model_name = "meta-llama/Meta-Llama-3-8B"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, padding_side="left")
print(tokenizer(["Sample input"], return_offsets_mapping=True))

will yield:

{'input_ids': [[128000, 18031, 1988]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 0), (6, 6)]]}

Offset_mapping should have tuples representing (char_start, char_end) for each token.

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-04-30T15:15:04Z

Hey! This seems to be expected no? The documentation might be wrong, but there are no offsets here (trim_offsets is set to False I think):
['Sample', 'Ġinput'] are the two tokens

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama3 tokenizer with Incorrect offset_mapping #1517

Llama3 tokenizer with Incorrect offset_mapping #1517

justin-shao commented Apr 27, 2024

ArthurZucker commented Apr 30, 2024

Llama3 tokenizer with Incorrect offset_mapping #1517

Llama3 tokenizer with Incorrect offset_mapping #1517

Comments

justin-shao commented Apr 27, 2024

ArthurZucker commented Apr 30, 2024