//
sign in
Profile
by @danabra.mov
Profile
by @dansshadow.bsky.social
Profile
by @jimpick.com
AviHandle
by @danabra.mov
AviHandle
by @dansshadow.bsky.social
AviHandle
by @katherine.computer
EventsList
by @katherine.computer
ProfileHeader
by @dansshadow.bsky.social
ProfileHeader
by @danabra.mov
ProfileMedia
by @danabra.mov
ProfilePlays
by @danabra.mov
ProfilePosts
by @danabra.mov
ProfilePosts
by @dansshadow.bsky.social
ProfileReplies
by @danabra.mov
Record
by @atsui.org
Skircle
by @danabra.mov
StreamPlacePlaylist
by @katherine.computer
+ new component
Profile
Loading...
PhD student @ImperialCollege. Research Scientist Intern @Meta prev. @Cohere, @GoogleAI. Interested in generalisable learning and reasoning. She/her lisaalaz.github.io
Lisa Alazraki






Loading...
At #NeurIPS2025 today, @lisaalaz.bsky.social is presenting our joint paper on Reverse Engineering Human Preferences with Reinforcement Learning! Demonstrating undetectable attacks on LLM-as-a-judge benchmarks. Great collaboration with @cohereforai.bsky.social and a well-deserved NeurIPS spotlight!
6mo
We have released #AgentCoMa, an agentic reasoning benchmark where each task requires a mix of commonsense and math to be solved 🧐 LLM agents performing real-world tasks should be able to combine these different types of reasoning, but are they fit for the job? 🤔 🧵⬇️
We also observe that LLMs fail to activate all the relevant neurons when they attempt to solve the tasks in Agent-CoMa. Instead, they mostly activate neurons relevant to only one reasoning type, likely as a result of single-type reasoning patterns reinforced during training.
To learn more: Website: agentcoma.github.io Preprint: arxiv.org/abs/2508.19988 A big thanks to my brilliant coauthors Lihu Chen, Ana Brassard, @joestacey.bsky.social, @rahmanidashti.bsky.social and @marekrei.bsky.social! Note: We welcome submissions to the #AgentCoMa leaderboard from researchers 🚀
So why do LLMs perform poorly on the apparently simple tasks in #AgentCoMa? We find that tasks combining different reasoning types are a relatively unseen pattern for LLMs, leading the models to contextual hallucinations when presented with mixed-type compositional reasoning.
We test AgentCoMa on 61 contemporary LLMs of different sizes, including reasoning models (both SFT and RL-tuned). While the LLMs perform well on commonsense and math reasoning in isolation, they are far less effective at solving AgentCoMa tasks that require their composition!
In contrast, we find that: - LLMs perform relatively well on compositional tasks of similar difficulty when all steps require the same type of reasoning. - Non-expert humans with no calculator or internet can solve the tasks in #AgentCoMa as accurately as the individual steps.
9mo
9mo
9mo
9mo
9mo
9mo
Marek Rei
Lisa Alazraki
Lisa Alazraki
Lisa Alazraki
Lisa Alazraki
Lisa Alazraki
Lisa Alazraki