Autoregressive models produce text by predicting the histogram of the next token. Random auxiliary variables determine the token to choose.
Parallel Token Prediction directly learns what external randomness maps to which token. This allows to predict many tokens at once.