Unfortunately, we're withdrawing our paper "Tokenization with Split Trees" from arXiv. All our baseline tokenizers — BPE, WordPiece, and Unigram — were trained incorrectly because of a bug in the Hugging Face tokenizers library, so every comparison to ToaST in the paper is invalid.
So the most common merge pairs are never used, and tokens like " the" and " a" were never learned. Our baselines used 175 GB of CulturaX English text; the bug occurs after around 108 GB. We are very grateful to Sander Land for identifying that the baselines were missing common merge pairs.