Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.) #1544

Open
MilkClouds opened this issue Jun 4, 2024 · 6 comments

Comments

@MilkClouds
Copy link

MilkClouds commented Jun 4, 2024

When I'm trying to add some tokens in vocab, there's 3 issue in Fast type tokenizers; there's no problem in python tokenizer, though.

  • Spacebar before additional token would be deleted if added token was not special token
  • If the added token was not a special token, entering additional tokens in a row (without spaces) would prevent the subsequent token from being tokenized
  • A single spacebar(ID 28705) between two additional tokens was treated as two (ID 259) when the added token was a special token.

Source code to recall issue

from transformers import (
    AutoProcessor,
    LlamaTokenizer,
    LlamaTokenizerFast,
)

processor = AutoProcessor.from_pretrained(
    "HuggingFaceM4/idefics2-8b",
    do_image_splitting=True,
)
print(processor.tokenizer)


def test_tokenizer(tokenizer):
    # print(f"Tokenizer: {tokenizer}")
    print("=======")
    test_texts = [
        "!@#",
        "!@# ",
        "!@# <ACTION_1>",
        "!@# <ACTION_1> ",
        "!@# <ACTION_1> <ACTION_2>",
        "!@# <ACTION_1><ACTION_2>",
    ]
    for text in test_texts:
        print(f"{text:30}", tokenizer(text))


tokenizer = LlamaTokenizer.from_pretrained("HuggingFaceM4/idefics2-8b")
tokenizer.add_tokens([f"<ACTION_{idx}>" for idx in range(18)])
test_tokenizer(tokenizer)


tokenizer = LlamaTokenizer.from_pretrained("HuggingFaceM4/idefics2-8b")
tokenizer.add_tokens([f"<ACTION_{idx}>" for idx in range(18)], special_tokens=True)
test_tokenizer(tokenizer)

tokenizer = LlamaTokenizerFast.from_pretrained("HuggingFaceM4/idefics2-8b")
tokenizer.add_tokens([f"<ACTION_{idx}>" for idx in range(18)])
test_tokenizer(tokenizer)


tokenizer = LlamaTokenizerFast.from_pretrained("HuggingFaceM4/idefics2-8b")
tokenizer.add_tokens([f"<ACTION_{idx}>" for idx in range(18)], special_tokens=True)
test_tokenizer(tokenizer)

execution result

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
LlamaTokenizerFast(name_or_path='HuggingFaceM4/idefics2-8b', vocab_size=32000, model_max_length=1000000000000000019884624838656, is_fast=True, padding_side='left', truncation_side='right', special_tokens={'bos_token': '<s>', 'eos_token': '</s>', 'unk_token': '<unk>', 'pad_token': '<unk>', 'additional_special_tokens': ['<fake_token_around_image>', '<image>', '<end_of_utterance>']}, clean_up_tokenization_spaces=False),  added_tokens_decoder={
        0: AddedToken("<unk>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        1: AddedToken("<s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        2: AddedToken("</s>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        32000: AddedToken("<fake_token_around_image>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        32001: AddedToken("<image>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
        32002: AddedToken("<end_of_utterance>", rstrip=False, lstrip=False, single_word=False, normalized=False, special=True),
}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
=======
!@#                            {'input_ids': [1, 918, 28818, 28771], 'attention_mask': [1, 1, 1, 1]}
!@#                            {'input_ids': [1, 918, 28818, 28771, 28705], 'attention_mask': [1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004], 'attention_mask': [1, 1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 28705], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1> <ACTION_2>      {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 28705, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1><ACTION_2>       {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
=======
!@#                            {'input_ids': [1, 918, 28818, 28771], 'attention_mask': [1, 1, 1, 1]}
!@#                            {'input_ids': [1, 918, 28818, 28771, 28705], 'attention_mask': [1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004], 'attention_mask': [1, 1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 28705], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1> <ACTION_2>      {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 28705, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1><ACTION_2>       {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
=======
!@#                            {'input_ids': [1, 918, 28818, 28771], 'attention_mask': [1, 1, 1, 1]}
!@#                            {'input_ids': [1, 918, 28818, 28771, 28705], 'attention_mask': [1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 32004], 'attention_mask': [1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 32004, 28705], 'attention_mask': [1, 1, 1, 1, 1, 1]}
!@# <ACTION_1> <ACTION_2>      {'input_ids': [1, 918, 28818, 28771, 32004, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1]}
!@# <ACTION_1><ACTION_2>       {'input_ids': [1, 918, 28818, 28771, 32004, 28789, 17615, 28730, 28750, 28767], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1, 1]}
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
=======
!@#                            {'input_ids': [1, 918, 28818, 28771], 'attention_mask': [1, 1, 1, 1]}
!@#                            {'input_ids': [1, 918, 28818, 28771, 28705], 'attention_mask': [1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004], 'attention_mask': [1, 1, 1, 1, 1, 1]}
!@# <ACTION_1>                 {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 259], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1> <ACTION_2>      {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 259, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1]}
!@# <ACTION_1><ACTION_2>       {'input_ids': [1, 918, 28818, 28771, 28705, 32004, 32005], 'attention_mask': [1, 1, 1, 1, 1, 1, 1]}

Additional Note

If I use from_slow option to load Fast Tokenizer, it have no problem.
tokenizer = LlamaTokenizerFast.from_pretrained("HuggingFaceM4/idefics2-8b", from_slow=True)

@MilkClouds MilkClouds changed the title Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.) [BUG] Fast tokenizer does not deal with AddedTokens properly(no problem in Transformers python tokenizer impl.) Jun 4, 2024
@MilkClouds
Copy link
Author

$ pip list | grep tokenizer                                                                                                                                                                   ✹
tokenizers                0.19.1
$ pip list | grep transformers                                                                                                                                                               
sentence-transformers     3.0.0
transformers              4.41.2

@ArthurZucker
Copy link
Collaborator

Hey! I think most of these can be removed if you set the legacy=False flag when initializing the tokenizer. I'll talk to the M4 team about this.

Basically the normalizer is prepending a space before each token, and before each split! For more details huggingface/transformers#28881

Copy link

github-actions bot commented Jul 6, 2024

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the Stale label Jul 6, 2024
@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 11, 2024
@MilkClouds
Copy link
Author

MilkClouds commented Jul 12, 2024

processor.tokenizer.legacy is already False, but same issue is occuring until now.

transformers version: 4.42.3
tokenizers version: 0.19.1

the output is exactly same as the time I first reported this issue.

@ArthurZucker
Copy link
Collaborator

Hey! As you mention:

tokenizer = LlamaTokenizerFast.from_pretrained("HuggingFaceM4/idefics2-8b", from_slow=True)

the issue is that if you do not call "from_slow" I cannot update the tokenizer for you. I mean I'll ping the team internally for sure, but we need to re-upload tokenizer.json!

@ArthurZucker ArthurZucker reopened this Jul 12, 2024
@ArthurZucker
Copy link
Collaborator

Can you open a PR on the hub and ping me here with the link? 🤗

@github-actions github-actions bot removed the Stale label Jul 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants