(2/5) Using matched triplets (ampicillin → dimicillin → dimiglimto), we isolate the effect of the affix.
Models treat fake drugs as real far more often than nonce names, especially larger and medical-tuned models.
Even in names like “tablecillin,” the affix still drives the prediction.
Hello world 👋
My first paper at UT Austin!
We ask: what happens when medical “evidence” fed into an LLM is wrong? Should your AI stay faithful, or should it play it safe when the evidence is harmful?
We show that frontier LLMs accept counterfactual medical evidence at face value.🧵
Setup (2/4)
We introduce MedCounterFact, a counterfactual medical QA dataset built on RCT-based evidence synthesis.
– Replace real interventions in evidence with nonce, mismatched medical, non-medical, or toxic terms
– Evaluate 9 frontier LLMs under evidence-grounded prompts
📄 Paper: arxiv.org/abs/2606.05616
💻 Github: github.com/KaijieMo-kj/...
w/ @kaijie-mo.bsky.social, Thomas Yang, @chantalsh.bsky.social, @qyao.bsky.social, William Rudman, @ramezkouzy.bsky.social, @kanishka.bsky.social
, @byron.bsky.social, @jessyjli.bsky.social
(5/5)
(4/5) Where does the shortcut live?
Activation patching localizes it to early-mid layers (~2–10). For affix-class drugs, affixes alone reproduce most of the effect.
A single low-rank direction can flip fake-drug acceptance. Affix signals emerge early in training; holistic knowledge comes later.
“Dimicillin” isn’t real. We made it up. Yet many LLMs still call it an antibiotic.
Across 9 models and 653 drugs, we find that drug-name affixes alone can drive pharmacological reasoning. Models often rely on morphology over facts.
We trace this shortcut from behavior to mechanism. 🧵
(3/5) How much of a drug’s meaning is just its affix?
We decompose recognition into Affix, Stem, and Holistic signals. Many drugs are affix-driven, and models sometimes confuse drugs sharing the same affix.
Reliance varies by task and training exposure, with stronger shortcuts for rarer drugs.
Results (3/4)
– With evidence, models strongly adhere to it with high confidence, even for toxic or nonsensical interventions
– Implausibility awareness is transient; once evidence appears, models rarely flag problems
– Scaling, medical fine-tuning, and skeptical prompting offer little protection
The morphological form of a word can often give cues to its meaning, but purely relying on these mappings can lead to overgeneralization in high-stakes domains. In the medical domain, for instance, LL...
arxiv.org
In high-stakes domains like medicine, it may be generally desirable for models to faithfully adhere to the context provided. But what happens if the context does not align with model priors or safety ...