We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations.
https://evalevalai.com/
EvalEval Coalition
Loading...
Read the full announcement: evalevalai.com/infrastructu...
Shared Task: evalevalai.com/events/share...
Project Webpage: evalevalai.com/projects/eve...
#AIEvaluation #EvalEval
🤔Consider the scenario
LLaMA 65B scored 0.637 on HELM's MMLU
LLaMA 65B scored 0.488 on lm-eval-harness's MMLU
Same model. Same benchmark name. Different prompts, settings, extraction methods.
💡Which score is right? Both? Neither? We can't compare. 🤷