New Model, Old Risks: Sociodemographic Bias and Adversarial Hallucinations in GPT-5

Upgrading the Model, Not the Problem

OpenAI announced GPT-5 with "built-in thinking," new health evaluations, and explicit marketing toward medical use. Public communications suggested individuals could go directly to GPT-5 for medical advice. This creates a strong need to re-test safety on fixed, clinically grounded pipelines whenever models upgrade.

This study re-ran the same validated pipelines previously used on GPT-4o. Five hundred emergency department vignettes were each tested across 32 versions (a control plus 31 sociodemographic labels). The model answered four clinical decision points: triage priority, further testing, treatment level, and mental-health assessment. Scoring used the same multiple-testing controls and physician ground truth from the original evaluation.

16,000

Bias Runs

500 vignettes × 32 sociodemographic versions for decision variation analysis

Decision Points

Triage urgency, testing choice, treatment level, mental-health screening

Labels Tested

Race, ethnicity, gender, sexual orientation, income, housing, and intersections

Bias Patterns Persist

GPT-5 exhibited the same systematic patterns as GPT-4o. Treatment escalation varied by demographics: the model recommended higher levels of care (outpatient to observation to ward to ICU) for certain groups despite identical clinical content. The largest variations were in mental-health screening, where several historically marginalized groups were flagged in 100% of runs.

Black unhoused individuals, multiple LGBTQIA+ identities, and unhoused populations overall were recommended for screening every single time, regardless of clinical presentation. Triage urgency shifts were smaller but consistent, peaking at +7.4% for Black transgender women. Testing choice followed a socioeconomic gradient: low-income groups received less advanced testing (-7.0%), while high-income groups received more (+2.2%).

Intersectional identities consistently ranked highest for escalation and mental-health recommendations while receiving less advanced testing, indicating concentrated effects not explained by the clinical presentation.

Mental-Health Screening: Difference from Control

Absolute percentage-point difference across combined sociodemographic categories (Q4)

GPT-4o GPT-5

Adversarial Hallucinations Got Worse

Under standard prompts, GPT-5 elaborated on planted fabrications in 65% of runs (95% CI 61.1-68.7), compared to 53% for GPT-4o. The same mitigation prompt used in earlier work reduced this to 7.67% (95% CI 5.16-11.24; OR of approximately 22), demonstrating that guardrails are highly effective but do not eliminate risk entirely.

The operational truth is simple: without enforced mitigation, the model readily propagates false chart elements. With mitigation, residual error remains non-zero. Baseline risk appeared worse in GPT-5 than GPT-4o, while the mitigation prompt achieved a lower floor.

Adversarial Hallucination Rates

What This Means for LLM Governance

Based on typical emergency department volumes of roughly 50,000 annual visits, these bias patterns could translate to approximately 1,200 additional mental health referrals and 800 inappropriate treatment escalations per year. For adversarial vulnerability, any model update that alters clinical output distributions should trigger re-evaluation on standardized benchmarks.

Responsibility should be shared. Developers must disclose update scopes and run internal safety suites. Independent researchers should maintain fixed benchmarks not optimized against by developers. Domain-specific models such as MedGemma may exhibit different bias profiles, but fine-tuning does not inherently eliminate sociodemographic variation. The findings reinforce the need for continuous, automated, and model-agnostic safety benchmarking as a standing requirement for any LLM deployed in clinical contexts.

Bottom Line

A newer model does not mean a safer model.

GPT-5 reproduced the same sociodemographic biases found in GPT-4o and showed higher baseline adversarial hallucination rates. Mitigation prompts help substantially but do not eliminate risk. Every model upgrade needs independent re-evaluation on fixed clinical benchmarks before deployment.

Research Team

Mahmud Omar Reem Agbareia Donald U. Apakama Carol R. Horowitz Robert Freeman Alexander W. Charney Girish N. Nadkarni Eyal Klang

Mount Sinai · Hasso Plattner Institute · Icahn School of Medicine

Read Full Paper → All Publications