Inlay

//

ProfilePosts

Loading...

To learn more: Website: agentcoma.github.io Preprint: arxiv.org/abs/2508.19988 A big thanks to my brilliant coauthors Lihu Chen, Ana Brassard, @joestacey.bsky.social, @rahmanidashti.bsky.social and @marekrei.bsky.social! Note: We welcome submissions to the #AgentCoMa leaderboard from researchers 🚀

9mo

AgentCoMa is an Agentic Commonsense and Math benchmark where each compositional task requires both commonsense and mathematical reasoning to be solved. The tasks are set in real-world scenarios:…

agentcoma.github.io

AgentCoMa

We have released #AgentCoMa, an agentic reasoning benchmark where each task requires a mix of commonsense and math to be solved 🧐 LLM agents performing real-world tasks should be able to combine these different types of reasoning, but are they fit for the job? 🤔 🧵⬇️

In contrast, we find that: - LLMs perform relatively well on compositional tasks of similar difficulty when all steps require the same type of reasoning. - Non-expert humans with no calculator or internet can solve the tasks in #AgentCoMa as accurately as the individual steps.