Inlay

We are a researcher community developing scientifically grounded research outputs and robust deployment infrastructure for broader impact evaluations. https://evalevalai.com/

EvalEval Coalition

Loading...

Read the full announcement: evalevalai.com/infrastructu... Shared Task: evalevalai.com/events/share... Project Webpage: evalevalai.com/projects/eve... #AIEvaluation #EvalEval

🤔Consider the scenario LLaMA 65B scored 0.637 on HELM's MMLU LLaMA 65B scored 0.488 on lm-eval-harness's MMLU Same model. Same benchmark name. Different prompts, settings, extraction methods. 💡Which score is right? Both? Neither? We can't compare. 🤷