BPE Trainer doesn't respect the vocab_size
parameter when dataset size is increased
#1514
Labels
vocab_size
parameter when dataset size is increased
#1514
I'm training a new tokenizer on an Indic language, Tamil. I tried two different runs:
Test run with part of the data used for training ~0.3Gb
This gives me a vocab file with exactly 2000 tokens as here and merges are computed correctly.
ta_vocab_pretok_2000.json
Run with entire data used for training ~ 15Gb
This gives me a much larger vocab file with no merges. Vocab count is ~5800, ignores the value 2000 I passed to the trainer.
ta_vocab_pretok_2000_full_data.json
Questions:
The text was updated successfully, but these errors were encountered: