Check out our new paper, investigating phenomena (hallucination, refusal, and sycophancy) both externally and internally! Showing a high correlation between the two!
Check out our new paper on evaluating LLM agents on their preference for achieving their goal and avoiding human harm, called ManagerBench👔
ManagerBench was accepted to #ICLR2026🎉
Check it out⬇️
Adi Simhi
Adi Simhi
Adi Simhi
ManagerBench was accepted to ICLR! @iclr-conf.bsky.social #ICLR2026
LLMs are still either unsafe, or completely harm avoidant - even when the harm affects furniture 🛋️
Check out our benchmark, online or in Rio 🇧🇷
🤔What happens when LLM agents choose between achieving their goals and avoiding harm to humans in realistic management scenarios? Are LLMs pragmatic or prefer to avoid human harm?
🚀 New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs🚀🧵
How does an LLM’s past influence its future?🤔
In new work, led by @adisimhi.bsky.social, together with @fbarez.bsky.social @boknilev.bsky.social and Shay Cohen, we find conversational history creates a latent "geometric trap" which makes old habits e.g. hallucinations hard to break!
🤔What happens when LLM agents choose between achieving their goals and avoiding harm to humans in realistic management scenarios? Are LLMs pragmatic or prefer to avoid human harm?
🚀 New paper out: ManagerBench: Evaluating the Safety-Pragmatism Trade-off in Autonomous LLMs🚀🧵