Inlay

Profile

@craigschmidt.com has a second paper using LPs for tokenisation coming out today as well! Check it out: bsky.app/profile/crai...

arxiv.org/abs/2605.22705 arxiv.org/abs/2605.22821 Happy Linear Programming for Tokenization day! I was involved with two separate papers that hit ArXiv yesterday, using LP's to find the vocabulary maximizing compression, depending on the kind of inference you want to use.

Our new paper reformulates tokenisation as a linear program (LP), which we solve to get SOTA tokenisers 😁 As a bonus, this LP tells us how close to optimal any tokeniser is! Check it out 👇 w/ J. Tempus, @philipwitti.bsky.social, @craigschmidt.com, D. Komm Paper: arxiv.org/abs/2605.22821

Thrilled to share my first paper! 📄 We prove optimal tokenization is NP-hard on bounded alphabets (like bytes)—even unary for direct tokenization! Big thanks @tpimentel.bsky.social, @philipwitti.bsky.social & Dennis Komm for the mentorship! Best birthday gift. 🎂 arxiv.org/abs/2511.15709

This was joint work with @vkastreva.bsky.social, @philipwitti.bsky.social, D. Komm! Violeta is a super smart student, who is definitely gonna do lots more interesting work :) It's her first paper, and it's also her birthday today 🥳 so follow her if you like this! Paper: arxiv.org/abs/2511.15709

More precisely, we show that: (i) for binary alphabets, not only finding an optimal tokeniser is NP-hard, but also finding arbitrarily good approximations; (ii) for unary alphabets, finding an optimal direct tokeniser is NP-hard!

Tokenisers are a vital part of LLMs, but how hard is it to find an optimal one? 🤔 Considering arbitrarily large alphabets, prior work showed this is NP-hard. But what if we use bytes instead? Or unary strings like a, aa, aaa, ...? In our new paper, we show this is still hard, NP-hard!

Interested in provable guarantees and fundamental limitations of XAI? Join us at the "Theory of Explainable AI" workshop Dec 2 in Copenhagen! @ellis.eu @euripsconf.bsky.social Speakers: @jessicahullman.bsky.social @doloresromerom.bsky.social @tpimentel.bsky.social Call for Contributions: Oct 15

Interested in language models, brains, and concepts? Check out our COLM 2025 🔦 Spotlight paper! (And if you’re at COLM, come hear about it on Tuesday – sessions Spotlight 2 & Poster 2)!