arxiv.org/abs/2605.22705
arxiv.org/abs/2605.22821
Happy Linear Programming for Tokenization day! I was involved with two separate papers that hit ArXiv yesterday, using LP's to find the vocabulary maximizing compression, depending on the kind of inference you want to use.
We introduce Tokenization with Split Trees (ToaST), a subword tokenization method that directly optimizes compression under a new recursive inference procedure. ToaST greedily splits each pretoken int...