Training HuggingFace tokenizer - ignore_merges #1537

ykoyfman · 2024-05-22T17:49:31Z

Looking through Llama3 changes, I see that "ignore_merges" was added as a property to support conversion from tiktoken models. Can a native HF tokenizer be trained using this property? It's not clear if this is possible with, say, train_new_from_iterator. CC @ArthurZucker - Thanks

The text was updated successfully, but these errors were encountered:

ArthurZucker · 2024-06-06T09:17:18Z

I think that it's not the case yet, but we should support it!

ArthurZucker · 2024-07-16T11:49:50Z

(Anyone feel free to open a PR if you have time!)

ArthurZucker added Feature Request planned labels Jun 6, 2024

github-actions bot added the Stale label Jul 7, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 13, 2024

huggingface deleted a comment from github-actions bot Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Training HuggingFace tokenizer - ignore_merges #1537

Training HuggingFace tokenizer - ignore_merges #1537

ykoyfman commented May 22, 2024

ArthurZucker commented Jun 6, 2024

ArthurZucker commented Jul 16, 2024

Training HuggingFace tokenizer - ignore_merges #1537

Training HuggingFace tokenizer - ignore_merges #1537

Comments

ykoyfman commented May 22, 2024

ArthurZucker commented Jun 6, 2024

ArthurZucker commented Jul 16, 2024