//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
ProfilePosts









Loading...
Unfortunately, we're withdrawing our paper "Tokenization with Split Trees" from arXiv. All our baseline tokenizers — BPE, WordPiece, and Unigram — were trained incorrectly because of a bug in the Hugging Face tokenizers library, so every comparison to ToaST in the paper is invalid.
So the most common merge pairs are never used, and tokens like " the" and " a" were never learned. Our baselines used 175 GB of CulturaX English text; the bug occurs after around 108 GB. We are very grateful to Sander Land for identifying that the baselines were missing common merge pairs.