So the most common merge pairs are never used, and tokens like " the" and " a" were never learned. Our baselines used 175 GB of CulturaX English text; the bug occurs after around 108 GB. We are very grateful to Sander Land for identifying that the baselines were missing common merge pairs.
TokShop will be at #COLM2026!
🗓️ October 9th, 2026
📍 San Francisco, USA
More details and a call for papers coming soon.
Our new paper reformulates tokenisation as a linear program (LP), which we solve to get SOTA tokenisers 😁 As a bonus, this LP tells us how close to optimal any tokeniser is! Check it out 👇
w/ J. Tempus, @philipwitti.bsky.social, @craigschmidt.com, D. Komm
Paper: arxiv.org/abs/2605.22821
The bug is Hugging Face tokenizers issue #2058 (github.com/huggingface/...): the library counts merge pairs in an i32 hash map, and once any pair's count crosses 2^31 − 1 = 2,147,483,647 the counter wraps to a negative value and the pair never gets selected as a merge.
We can't claim anything about ToaST's performance until we retrain the baselines and rerun the evaluation, and we expect a lot of ToaST's apparent advantage was really just broken baselines. ToaST itself runs through our own code, and those numbers are correct.
There are two different ways that the Huggingface Word Piece implementation can produce <UNK> tokens even with ByteLevel pretokenization. A nice blog post from Stéphan Tulkens talks about how to fix one of them, in response to a question of mine.
stephantul.github.io/blog/better-...
Unfortunately, we're withdrawing our paper "Tokenization with Split Trees" from arXiv. All our baseline tokenizers — BPE, WordPiece, and Unigram — were trained incorrectly because of a bug in the Hugging Face tokenizers library, so every comparison to ToaST in the paper is invalid.
I’m at @colmweb.org this week in Montreal. Come see our BoundlessBPE paper in the Wed morning poster session. Love to talk to anyone else here, especially about tokenization. #COLM2025
The other is that is there isn't a way to specify an initial vocabulary with all 256 bytes including the continuation character ##. See github.com/huggingface/.... So in short, if you use their WordPiece you might get <UNK> tokens.
arxiv.org/abs/2605.22705
arxiv.org/abs/2605.22821
Happy Linear Programming for Tokenization day! I was involved with two separate papers that hit ArXiv yesterday, using LP's to find the vocabulary maximizing compression, depending on the kind of inference you want to use.