Upgrading the Model, Not the Problem
OpenAI announced GPT-5 with "built-in thinking," new health evaluations, and explicit marketing toward medical use. Public communications suggested individuals could go directly to GPT-5 for medical advice. This creates a strong need to re-test safety on fixed, clinically grounded pipelines whenever models upgrade.
This study re-ran the same validated pipelines previously used on GPT-4o. Five hundred emergency department vignettes were each tested across 32 versions (a control plus 31 sociodemographic labels). The model answered four clinical decision points: triage priority, further testing, treatment level, and mental-health assessment. Scoring used the same multiple-testing controls and physician ground truth from the original evaluation.
Bias Patterns Persist
GPT-5 exhibited the same systematic patterns as GPT-4o. Treatment escalation varied by demographics: the model recommended higher levels of care (outpatient to observation to ward to ICU) for certain groups despite identical clinical content. The largest variations were in mental-health screening, where several historically marginalized groups were flagged in 100% of runs.
Black unhoused individuals, multiple LGBTQIA+ identities, and unhoused populations overall were recommended for screening every single time, regardless of clinical presentation. Triage urgency shifts were smaller but consistent, peaking at +7.4% for Black transgender women. Testing choice followed a socioeconomic gradient: low-income groups received less advanced testing (-7.0%), while high-income groups received more (+2.2%).
Intersectional identities consistently ranked highest for escalation and mental-health recommendations while receiving less advanced testing, indicating concentrated effects not explained by the clinical presentation.
Adversarial Hallucinations Got Worse
Under standard prompts, GPT-5 elaborated on planted fabrications in 65% of runs (95% CI 61.1-68.7), compared to 53% for GPT-4o. The same mitigation prompt used in earlier work reduced this to 7.67% (95% CI 5.16-11.24; OR of approximately 22), demonstrating that guardrails are highly effective but do not eliminate risk entirely.
The operational truth is simple: without enforced mitigation, the model readily propagates false chart elements. With mitigation, residual error remains non-zero. Baseline risk appeared worse in GPT-5 than GPT-4o, while the mitigation prompt achieved a lower floor.
What This Means for LLM Governance
Based on typical emergency department volumes of roughly 50,000 annual visits, these bias patterns could translate to approximately 1,200 additional mental health referrals and 800 inappropriate treatment escalations per year. For adversarial vulnerability, any model update that alters clinical output distributions should trigger re-evaluation on standardized benchmarks.
Responsibility should be shared. Developers must disclose update scopes and run internal safety suites. Independent researchers should maintain fixed benchmarks not optimized against by developers. Domain-specific models such as MedGemma may exhibit different bias profiles, but fine-tuning does not inherently eliminate sociodemographic variation. The findings reinforce the need for continuous, automated, and model-agnostic safety benchmarking as a standing requirement for any LLM deployed in clinical contexts.
A newer model does not mean a safer model.
GPT-5 reproduced the same sociodemographic biases found in GPT-4o and showed higher baseline adversarial hallucination rates. Mitigation prompts help substantially but do not eliminate risk. Every model upgrade needs independent re-evaluation on fixed clinical benchmarks before deployment.