I am humbled to join this excellent team and work on delivering the highest quality human preference LLM evals! ⚔️⚔️⚔️
I've been following this project since it first showed up in my google scholar notifications for papers that cite Elo in 2023 and had fun experimenting with their data and contributing open source before it was a company.
Then I spent another hour debugging the data for nans and nulls and corruption until I realized that it actually was Simpson's paradox
Just ran into Simpsons paradox in the wild for the first time lol. Was looking at some data and was like "that doesn't look right all the means went up when all I did was assign groups differently, this is like Simpson's paradox or something lol"