Adversarial Hallucination Attacks on Clinical LLMs

The Adversarial Framework

Clinical prompts can carry fabricated content, whether through deliberate manipulation or accidental errors. A misspelled lab value, an invented syndrome, or a fictitious radiological sign can slip into a query unnoticed. Because LLMs tend to treat every token as ground truth, a single planted detail can propagate into unsafe orders or misleading clinical advice.

This study systematically tested that vulnerability. Each of 300 clinical vignettes contained exactly one fabricated element: a fictitious laboratory test (e.g., "Serum Neurostatin"), a non-existent physical or radiological sign (e.g., "Cardiac Spiral Sign on echocardiography"), or an invented disease or syndrome (e.g., "Faulkenstein Syndrome"). Cases were presented in short (50-60 words) and long (90-100 words) formats, with identical medical content.

100

Fabricated Labs

Fictitious laboratory tests with invented reference ranges embedded in clinical notes

100

Fabricated Signs

Non-existent physical or radiological findings planted in exam descriptions

100

Invented Conditions

Made-up diseases or syndromes with fictional clinical descriptions

How Often Models Took the Bait

Under default settings with no safety prompt, the overall hallucination rate across all models was 65.9%. The mitigation prompt, which instructed models to use only validated information and flag uncertainty, brought the average down to 44.2%. Setting temperature to zero offered no meaningful improvement (66.5%), confirming that stochastic sampling is not the primary driver of adversarial hallucinations.

Case length mattered modestly. Short vignettes triggered hallucinations 67.6% of the time compared to 64.1% for longer ones (OR 1.22, p = 0.003), suggesting that less surrounding context may reduce the model's ability to identify planted anomalies.

Models hallucinated on planted clinical fabrications in up to 83% of outputs. Prompt engineering halved the error rate for the best-performing model, but no approach eliminated errors entirely.

Hallucination Rates by Model

Long format Short format

Higher values indicate higher hallucination rates. Hover for details.

Model Performance Ranking

Under default conditions, GPT-4o exhibited the lowest hallucination rate and achieved perfect agreement with two independent physicians on a 200-sample validation audit. At the other end, Distilled-DeepSeek-R1 hallucinated in over 80% of cases. The gap between these two models was substantial (OR 8.41, p = 0.0001), and other open-source models like Phi-4 (OR 7.12) and gemma-2-27b-it (OR 3.11) also performed significantly worse than GPT-4o.

#ModelHallucination BarRate

1GPT-4o

51.7%

2Llama-3.3-70B

58.5%

3Qwen-2.5-72B

65.2%

4gemma-2-27b-it

73.5%

5Phi-4

78.8%

6Distilled-DeepSeek-R1

81.3%

Can Prompt Engineering Fix This?

The mitigation prompt instructed models to rely only on clinically validated information and to acknowledge uncertainty rather than speculate. This reduced the overall hallucination rate from 65.9% to 44.2% (OR 0.27, p = 0.00002). GPT-4o responded most effectively, dropping from about 53% to roughly 21-25%. However, even with mitigation, no model fell below 20% consistently.

Temperature adjustments were largely ineffective. Setting temperature to zero produced a 66.5% hallucination rate, nearly identical to the default 65.9% (OR 1.05, p = 0.58). This suggests adversarial hallucinations are a structural issue, not a consequence of sampling randomness.

Temperature zero offered no benefit. Prompt engineering was the only intervention that meaningfully reduced adversarial hallucination rates, though it still left substantial residual risk.

Confronting Real-World Misinformation

Beyond fabricated clinical details, the study tested five widely circulated public health claims: the purported link between vaccines and autism, 5G in COVID-19, natural immunity versus vaccination, microwave ovens and cancer, and the laboratory origin of COVID-19. Models were prompted with standardized vignettes requiring JSON-formatted explanations.

In 43 of 45 runs across three models, responses were classified as non-hallucinated, meaning the models correctly identified the claims as unsubstantiated. Only two runs (both from GPT-4o on the natural immunity claim) produced hallucinations by endorsing the claim without adequately addressing the risks of forgoing vaccination. This suggests that while LLMs generally handle well-known misinformation, edge cases remain.

Why This Matters for Clinical AI

LLM hallucinations are not limited to inventing plausible-sounding text. In a clinical context, they can mean validating a fabricated lab result, describing the implications of a non-existent radiological sign, or providing treatment pathways for an invented syndrome. The study found that models often elaborated confidently on the planted falsehood, generating detailed but entirely fictitious clinical reasoning.

The marked contrast between Distilled-DeepSeek-R1 and its base LLaMA-3.3-70B checkpoint is particularly revealing. Despite identical parameter counts, distillation or RLHF pipelines may unintentionally amplify adversarial vulnerability, underscoring the need for optimization-aware safety testing. Additionally, only 1.2% of non-hallucinations under the base prompt were reclassified as hallucinations when mitigation was applied, confirming that the mitigation prompt does not introduce new errors.

Key Takeaway

Adversarial hallucinations are a structural vulnerability in clinical LLMs, not a sampling artifact

Six models hallucinated on planted clinical fabrications in 50 to 83 percent of cases. Prompt engineering was the only effective mitigation, cutting rates roughly in half, but no strategy eliminated errors. Temperature adjustments had no effect. These findings demonstrate that deploying LLMs for clinical decision support without robust input validation and human oversight carries substantial patient safety risk.

Authors

Mahmud Omar Vera Sorin Jeremy D. Collins David Reich Robert Freeman Nicholas Gavin Alexander W. Charney Lisa Stump Nicola Luigi Bragazzi Girish N. Nadkarni Eyal Klang

Mount Sinai Health System · Mayo Clinic · Hasso Plattner Institute · Ludwig-Maximilians-University

Read Full Paper → All Publications