Llama-3 offset-mapping needs fixing #1553

davidb-cerebras · 2024-06-14T23:39:07Z

Opening a new issue for the previously opened issue here -- #1517

Here we can see that the desired behavior for return_offsets_mapping from Mistral gives character indices corresponding to tokens:

(Pdb) from transformers import AutoTokenizer
(Pdb) tok_mistral = AutoTokenizer.from_pretrained("mistralai/Mistral-7B-Instruct-v0.1")
(Pdb) tok_mistral(["Sample input"], return_offsets_mapping=True)
{'input_ids': [[1, 27797, 2787]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 6), (6, 12)]]}
(Pdb) tok_mistral.convert_ids_to_tokens([1, 27797, 2787])
['<s>', '▁Sample', '▁input']
(Pdb) "Sample input"[0:6]
'Sample'
(Pdb) "Sample input"[6:12]
' input'

But for Llama-3 they are not correct

(Pdb) tok_llama3 = AutoTokenizer.from_pretrained("meta-llama/Meta-Llama-3-8B-Instruct") 
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
(Pdb) tok_llama3(["Sample input"], return_offsets_mapping=True)
{'input_ids': [[128000, 18031, 1988]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 0), (6, 6)]]}

We can also see Llama-2 and GPT-2 working the same as Mistral, so Llama-3 is definitely the one performing behavior that is unexpected

(Pdb) tok_llama2 = AutoTokenizer.from_pretrained("NousResearch/Llama-2-7b-hf")
(Pdb) tok_llama2(["Sample input"], return_offsets_mapping=True)
{'input_ids': [[1, 21029, 1881]], 'attention_mask': [[1, 1, 1]], 'offset_mapping': [[(0, 0), (0, 6), (6, 12)]]}
(Pdb) tok_gpt2 = AutoTokenizer.from_pretrained("openai-community/gpt2") 
(Pdb) tok_gpt2(["Sample input"], return_offsets_mapping=True)
{'input_ids': [[36674, 5128]], 'attention_mask': [[1, 1]], 'offset_mapping': [[(0, 6), (6, 12)]]}

The text was updated successfully, but these errors were encountered:

davidb-cerebras · 2024-06-17T23:20:59Z

@ArthurZucker Is it possible to fix this in tokenizers ?

ArthurZucker · 2024-06-18T07:10:48Z

Yep, you are right, I'll dive a bit to see why we have this!

davidb-cerebras · 2024-06-18T17:43:32Z

Awesome thank you!

maximilianmordig · 2024-06-24T12:13:56Z

@ArthurZucker Is there a workaround in the meantime?

ArthurZucker · 2024-07-12T10:28:25Z

sorry not yet! I am fixing bunch of stuff, maybe #1568 ?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama-3 offset-mapping needs fixing #1553

Llama-3 offset-mapping needs fixing #1553

davidb-cerebras commented Jun 14, 2024

davidb-cerebras commented Jun 17, 2024

ArthurZucker commented Jun 18, 2024

davidb-cerebras commented Jun 18, 2024

maximilianmordig commented Jun 24, 2024

ArthurZucker commented Jul 12, 2024

Llama-3 offset-mapping needs fixing #1553

Llama-3 offset-mapping needs fixing #1553

Comments

davidb-cerebras commented Jun 14, 2024

davidb-cerebras commented Jun 17, 2024

ArthurZucker commented Jun 18, 2024

davidb-cerebras commented Jun 18, 2024

maximilianmordig commented Jun 24, 2024

ArthurZucker commented Jul 12, 2024