← All Publications
The Lancet Digital Health 2026 AI Safety

Mapping LLM Susceptibility to Medical Misinformation Across Clinical Notes and Social Media

We benchmarked 20 large language models with 3.4 million prompts containing fabricated medical claims, drawn from real hospital notes, social media, and clinical vignettes, to find out where the guardrails break.

3.4M+
Model responses
20
LLMs benchmarked
31.7%
Overall susceptibility
10
Logical fallacies tested

Large language models are increasingly used in healthcare: summarizing notes, answering patient questions, supporting clinical decisions. But when they encounter fabricated medical claims, how often do they push back? And how often do they simply go along?

We tested this at scale. We embedded false medical recommendations into realistic prompts across three source types and measured whether 20 leading LLMs accepted or rejected them. We also tested whether wrapping claims in logical fallacies changed the outcome.

Format Matters More Than You Think

The single biggest predictor of susceptibility was not model size or architecture. It was the source format. A fabricated claim written in the formal, declarative tone of a clinical discharge note was accepted nearly half the time. The same claim framed as a Reddit post triggered far more skepticism.

Clinical Notes
46.1%

Discharge Notes

Formal clinical language bypasses safety filters most effectively.

Social Media
8.9%

Reddit Posts

Informal, emotional tone triggers more built-in skepticism.

Synthetic
5.1%

Clinical Vignettes

Controlled scenarios produced the lowest acceptance rates.

"Quiet, authoritative falsehoods slip through safety filters far more easily than the rhetorical tricks models have been trained to catch."

Susceptibility Across Models and Fallacy Types

The heatmap below shows how each model responded to each fallacy type. Hover over any cell to see the exact rate. Toggle between susceptibility (how often models accepted false claims) and detection (how often they correctly identified the fallacy).

Model Performance by Fallacy Type
Percentage rate per model-fallacy combination
0%
50%+

The Fallacy Paradox

Counterintuitively, wrapping misinformation in logical fallacies generally reduced susceptibility. Eight of ten framings lowered acceptance rates. Safety fine-tuning has exposed models to adversarial dialogues prefaced with rhetorical markers like "everyone says" or "a famous doctor claims." Models recognize the template of a trick, but miss the quiet lie stated in plain, clinical language.

The two exceptions: slippery slope (+2.2 pp) and false dilemma (+0.4 pp). These framings present false urgency rather than false evidence, and appear underrepresented in current safety training data.

How the Models Stack Up

Composite Robustness Score (Top 7)

1GPT-4o
0.895
2Llama-4-Scout
0.864
3gpt-oss-20b
0.858
4Qwen3-30B-A3B
0.855
5Phi-4
0.820
6Gemma-3-12b
0.811
7Llama-3.3-70B
0.770

Notably, gpt-oss-20b achieved the lowest practical susceptibility of any model (0.7%) despite its moderate size. Medical fine-tuned models consistently underperformed their general-purpose counterparts, suggesting that domain specialization can come at the cost of safety robustness.

Bottom Line

Safety will not come from scale alone. It requires context-sensitive guardrails tuned to clinical language, grounding strategies that verify claims against trusted sources, and targeted immunization against the quiet misinformation that current safety training misses.

Systems that surface discharge recommendations or generate after-visit summaries need safeguards designed specifically for formal medical text, the format where models are most vulnerable.

Research Team
Mahmud Omar Vera Sorin Lothar H. Wieler Alexander W. Charney Patricia Kovatch Carol R. Horowitz Panagiotis Korfiatis Benjamin S. Glicksberg Robert Freeman Girish N. Nadkarni* Eyal Klang*

* Equal contribution · BIDMC · DFCI · Harvard Medical School

Read Full Paper → ← All Publications