Inlay

🤔Consider the scenario LLaMA 65B scored 0.637 on HELM's MMLU LLaMA 65B scored 0.488 on lm-eval-harness's MMLU Same model. Same benchmark name. Different prompts, settings, extraction methods. 💡Which score is right? Both? Neither? We can't compare. 🤷