Inlay

While effective for chess♟️, Elo ratings struggle with LLM evaluation due to volatility and transitivity issues. New post in collaboration with AI Singapore explores why Elo falls short for AI leaderboards and how we can do better.