PhD student @ImperialCollege. Research Scientist Intern @Meta prev. @Cohere, @GoogleAI. Interested in generalisable learning and reasoning. She/her
lisaalaz.github.io
Lisa Alazraki
Loading...
At #NeurIPS2025 today, @lisaalaz.bsky.social is presenting our joint paper on Reverse Engineering Human Preferences with Reinforcement Learning! Demonstrating undetectable attacks on LLM-as-a-judge benchmarks. Great collaboration with
@cohereforai.bsky.social and a well-deserved NeurIPS spotlight!
To learn more:
Website: agentcoma.github.io
Preprint: arxiv.org/abs/2508.19988
A big thanks to my brilliant coauthors Lihu Chen, Ana Brassard, @joestacey.bsky.social, @rahmanidashti.bsky.social and @marekrei.bsky.social!
Note: We welcome submissions to the #AgentCoMa leaderboard from researchers ๐
So why do LLMs perform poorly on the apparently simple tasks in #AgentCoMa?
We find that tasks combining different reasoning types are a relatively unseen pattern for LLMs, leading the models to contextual hallucinations when presented with mixed-type compositional reasoning.
We test AgentCoMa on 61 contemporary LLMs of different sizes, including reasoning models (both SFT and RL-tuned). While the LLMs perform well on commonsense and math reasoning in isolation, they are far less effective at solving AgentCoMa tasks that require their composition!
We have released #AgentCoMa, an agentic reasoning benchmark where each task requires a mix of commonsense and math to be solved ๐ง
LLM agents performing real-world tasks should be able to combine these different types of reasoning, but are they fit for the job? ๐ค
๐งตโฌ๏ธ
We also observe that LLMs fail to activate all the relevant neurons when they attempt to solve the tasks in Agent-CoMa. Instead, they mostly activate neurons relevant to only one reasoning type, likely as a result of single-type reasoning patterns reinforced during training.
In contrast, we find that:
- LLMs perform relatively well on compositional tasks of similar difficulty when all steps require the same type of reasoning.
- Non-expert humans with no calculator or internet can solve the tasks in #AgentCoMa as accurately as the individual steps.