//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...









Loading...
So the most common merge pairs are never used, and tokens like " the" and " a" were never learned. Our baselines used 175 GB of CulturaX English text; the bug occurs after around 108 GB. We are very grateful to Sander Land for identifying that the baselines were missing common merge pairs.
TokShop will be at #COLM2026! 🗓️ October 9th, 2026 📍 San Francisco, USA More details and a call for papers coming soon.
Our new paper reformulates tokenisation as a linear program (LP), which we solve to get SOTA tokenisers 😁 As a bonus, this LP tells us how close to optimal any tokeniser is! Check it out 👇 w/ J. Tempus, @philipwitti.bsky.social, @craigschmidt.com, D. Komm Paper: arxiv.org/abs/2605.22821
The bug is Hugging Face tokenizers issue #2058 (github.com/huggingface/...): the library counts merge pairs in an i32 hash map, and once any pair's count crosses 2^31 − 1 = 2,147,483,647 the counter wraps to a negative value and the pair never gets selected as a merge.
We can't claim anything about ToaST's performance until we retrain the baselines and rerun the evaluation, and we expect a lot of ToaST's apparent advantage was really just broken baselines. ToaST itself runs through our own code, and those numbers are correct.
There are two different ways that the Huggingface Word Piece implementation can produce <UNK> tokens even with ByteLevel pretokenization. A nice blog post from Stéphan Tulkens talks about how to fix one of them, in response to a question of mine. stephantul.github.io/blog/better-...
Unfortunately, we're withdrawing our paper "Tokenization with Split Trees" from arXiv. All our baseline tokenizers — BPE, WordPiece, and Unigram — were trained incorrectly because of a bug in the Hugging Face tokenizers library, so every comparison to ToaST in the paper is invalid.
I’m at @colmweb.org this week in Montreal. Come see our BoundlessBPE paper in the Wed morning poster session. Love to talk to anyone else here, especially about tokenization. #COLM2025
27d
The other is that is there isn't a way to specify an initial vocabulary with all 256 bytes including the continuation character ##. See github.com/huggingface/.... So in short, if you use their WordPiece you might get <UNK> tokens.
arxiv.org/abs/2605.22705 arxiv.org/abs/2605.22821 Happy Linear Programming for Tokenization day! I was involved with two separate papers that hit ArXiv yesterday, using LP's to find the vocabulary maximizing compression, depending on the kind of inference you want to use.
1mo
1mo
27d
27d
9mo
27d
8mo
9mo
1mo