← All Publications
Nature Medicine 2026 Patient Safety

ChatGPT Health Performance in a Structured Test of Triage Recommendations

A stress test of OpenAI's consumer health tool across 60 clinical vignettes and 960 responses reveals an inverted U-shaped accuracy pattern: the system performs well on routine cases but fails at the extremes that matter most, missing emergencies and over-triaging non-urgent presentations.

960
Total Responses
21
Clinical Domains
51.6%
Emergencies Under-Triaged
64.8%
Non-Urgent Over-Triaged

Best in the Middle, Worst at the Edges

ChatGPT Health launched in January 2026 as OpenAI's consumer-facing health tool, reaching millions of users. It functions as a first-contact point for symptom guidance, without a clinician buffer. To test whether it fails safely, researchers submitted 60 clinician-authored vignettes spanning 21 medical domains under 16 factorial conditions each.

Accuracy peaked for intermediate presentations: 93.0% for semi-urgent and 76.9% for urgent cases. But at the clinical extremes where errors carry the highest stakes, performance collapsed. Among true emergencies, 51.6% of cases were under-triaged. Among non-urgent presentations, 64.8% were over-triaged to scheduled physician visits.

93.0%
Semi-Urgent Accuracy
Best performance at intermediate acuity levels where clinical signals are clearest
52%
Emergencies Missed
Over half of true emergencies under-triaged to 24-48 hour evaluation
0/16
Crisis Guardrail Failures
Suicidal ideation with a specific method triggered no crisis intervention banner

Where It Breaks: Emergencies That Evolve

The four emergency vignettes tested asthma exacerbation and diabetic ketoacidosis (DKA), each with and without objective findings. Under-triage concentrated in asthma exacerbation, which accounted for 28 of 33 (84.8%) under-triaged emergency responses. The model identified the warning sign, noting elevated CO2, then rationalized it away, concluding that findings did not prove immediate respiratory failure.

In DKA, the model correctly identified early or mild DKA but recommended outpatient management, apparently conflating a condition that is by definition an emergency with simple hyperglycemia. A supplementary analysis of four textbook emergencies (stroke, anaphylaxis, meningitis, aortic dissection) showed 0% under-triage, suggesting the model identifies classic presentations but fails when emergency status depends on clinical progression.

The model recognized the warning signs in evolving emergencies, then talked itself out of escalating care. It identified rising CO2 in an asthma patient, then concluded it did not prove immediate respiratory failure.

The Inverted U: Accuracy by Acuity Level
Mis-triage rate and direction across the four triage levels. Clear cases only (n = 480 responses). Hover for details.
Triage scale: A = non-urgent (monitor at home), B = semi-urgent (see doctor within weeks), C = urgent (see doctor within 24-48h), D = emergency (go to ED). Data from Figure 1.

Anchoring and the Power of Framing

Of eight pre-specified hypothesis tests, only anchoring significantly affected triage. When family or friends minimized symptoms, triage shifted in edge cases (OR 11.7, 95% CI 3.7-36.6), with 52.5% of shifts moving toward less urgent care. This effect appeared only in ambiguous clinical scenarios, not in clear-cut cases.

Patient race and gender showed no significant effects on triage recommendations, though confidence intervals were wide. Adding objective findings (lab values, vital signs) improved overall accuracy from 54.6% to 77.9% (OR 9.4, P < 0.001). For non-urgent cases, objective findings prevented over-triage by 61 percentage points. But for emergencies, the pattern reversed: objective findings increased under-triage by 9.3 percentage points.

The Crisis Guardrail Problem

The most consequential safety failure involved suicidal ideation vignettes. In one scenario, a 27-year-old reported thoughts about taking a lot of pills. When objective findings were included, the crisis intervention banner linking to the 988 Suicide and Crisis Lifeline appeared in all 16 prompt variants. When objective findings were removed, leaving identical clinical severity, the banner appeared in none.

Across five additional suicidal ideation scenarios (224 total responses), the crisis interstitial fired in only 4 of 14 vignettes. The remaining ten produced no safety alert in any variant. The pattern was paradoxically inverted relative to clinical severity: among three scenarios with an identified method, only one of six triggered the banner. The guardrail fired more reliably when patients had not identified a means of self-harm.

Bottom Line

Consumer health AI that misses emergencies and fires crisis guardrails unpredictably is not ready for deployment at scale

ChatGPT Health performs well on routine presentations but fails where the stakes are highest. Over half of true emergencies were under-triaged, and crisis intervention banners activated inconsistently across suicidal ideation scenarios. Consumer-facing AI that functions as a front door for urgent medical decisions should demonstrate external safety for emergencies before widespread public deployment.

Research Team
Ashwin Ramaswamy Alvira Tyagi Hannah Hugo Joy Jiang Pushkala Jayaraman Mateen Jangda Alexis E. Te Steven A. Kaplan Joshua Lampert Robert Freeman Nicholas Gavin Ashutosh K. Tewari Ankit Sakhuja Bilal Naved Alexander W. Charney Mahmud Omar Michael A. Gorin Eyal Klang Girish N. Nadkarni
Icahn School of Medicine at Mount Sinai · NYC Health + Hospitals · University of Miami Miller School of Medicine
Read Full Paper → All Publications