🚀 By *learning* to compress the KV cache in Transformer LLMs, we can generate more tokens for the same compute budget.
This unlocks *inference-time hyper-scaling*
For the same runtime or memory load, we can boost LLM accuracy by pushing reasoning even further!