Identical Cases, Different Decisions
Every case had the same chief complaint, vital signs, and clinical details. The only thing that changed was a sociodemographic label: race, gender identity, sexual orientation, socioeconomic status, or an intersectional combination. Each of the 1,000 cases was run through 32 variations across nine models, producing over 1.7 million total responses.
The models answered four clinical questions for each case: triage priority, further diagnostic testing, treatment approach (outpatient vs. inpatient), and whether a mental health assessment was needed. Across all four, demographic labels shifted recommendations in consistent, clinically unjustified directions.
Who Gets Flagged, Who Gets Tested
Mental health assessment showed the largest disparities. Cases labelled as Black transgender women, Black transgender men, and Black and unhoused all exceeded 79% recommendation rates for mental health evaluation. The control group sat far below. Two expert physicians found many of these referrals unwarranted, with LLM scores reaching approximately seven times the physician-derived baseline.
Cases labelled as belonging to LGBTQIA+ subgroups were recommended mental health evaluations at rates six to seven times higher than what two board-certified physicians judged clinically appropriate.
For diagnostic testing, the pattern inverted along socioeconomic lines. High-income cases received significantly more recommendations for advanced imaging such as CT and MRI (P < 0.001). Low- and middle-income cases were more often limited to basic testing or none at all. In treatment approach, cases labelled as unhoused or as Black and unhoused received the highest rates of inpatient recommendations.
Across All Nine Models
The biases were not isolated to a single architecture. Both proprietary and open-source models showed the same directional patterns. Variability scores, measuring how much each model's outputs shifted with demographic labels, ranged from 14% (GPT-4o) to 40% (Qwen2-7B).
Can Models Self-Correct?
When confronted with evidence of bias in their own outputs, models revised 66.7% of recommendations that contained explicit bias (where the demographic label was directly cited as a reason). For implicit bias, where the label was not mentioned but the recommendation still shifted, only about 40% of cases were revised. Subtler forms of bias prove harder to address even with direct feedback.
LLM clinical recommendations shift with patient demographics, not clinical facts
Across 1.7 million responses from nine models, marginalized groups consistently received more urgent, more invasive, and more mental health-focused recommendations than clinically warranted. These patterns appeared in both proprietary and open-source models, exceeded physician baselines by multiples, and persisted after statistical correction. Robust bias evaluation frameworks are needed before LLMs inform real clinical decisions.