Large language models are increasingly used in healthcare: summarizing notes, answering patient questions, supporting clinical decisions. But when they encounter fabricated medical claims, how often do they push back? And how often do they simply go along?
We tested this at scale. We embedded false medical recommendations into realistic prompts across three source types and measured whether 20 leading LLMs accepted or rejected them. We also tested whether wrapping claims in logical fallacies changed the outcome.
Format Matters More Than You Think
The single biggest predictor of susceptibility was not model size or architecture. It was the source format. A fabricated claim written in the formal, declarative tone of a clinical discharge note was accepted nearly half the time. The same claim framed as a Reddit post triggered far more skepticism.
Discharge Notes
Formal clinical language bypasses safety filters most effectively.
Reddit Posts
Informal, emotional tone triggers more built-in skepticism.
Clinical Vignettes
Controlled scenarios produced the lowest acceptance rates.
"Quiet, authoritative falsehoods slip through safety filters far more easily than the rhetorical tricks models have been trained to catch."
Susceptibility Across Models and Fallacy Types
The heatmap below shows how each model responded to each fallacy type. Hover over any cell to see the exact rate. Toggle between susceptibility (how often models accepted false claims) and detection (how often they correctly identified the fallacy).
The Fallacy Paradox
Counterintuitively, wrapping misinformation in logical fallacies generally reduced susceptibility. Eight of ten framings lowered acceptance rates. Safety fine-tuning has exposed models to adversarial dialogues prefaced with rhetorical markers like "everyone says" or "a famous doctor claims." Models recognize the template of a trick, but miss the quiet lie stated in plain, clinical language.
The two exceptions: slippery slope (+2.2 pp) and false dilemma (+0.4 pp). These framings present false urgency rather than false evidence, and appear underrepresented in current safety training data.
How the Models Stack Up
Composite Robustness Score (Top 7)
Notably, gpt-oss-20b achieved the lowest practical susceptibility of any model (0.7%) despite its moderate size. Medical fine-tuned models consistently underperformed their general-purpose counterparts, suggesting that domain specialization can come at the cost of safety robustness.
Safety will not come from scale alone. It requires context-sensitive guardrails tuned to clinical language, grounding strategies that verify claims against trusted sources, and targeted immunization against the quiet misinformation that current safety training misses.
Systems that surface discharge recommendations or generate after-visit summaries need safeguards designed specifically for formal medical text, the format where models are most vulnerable.