← All Publications
Nature Medicine 2025 AI Bias & Equity

Sociodemographic Biases in Medical Decision Making by Large Language Models

Nine LLMs evaluated on 1,000 emergency department cases across 32 demographic variations produced systematically different clinical recommendations based on patient identity, not clinical need. LGBTQIA+ subgroups were flagged for mental health assessments six to seven times more often than physicians deemed appropriate.

1.7M+
Model Responses
9
LLMs Evaluated
32
Demographic Variations
6-7x
Mental Health Over-Referral

Identical Cases, Different Decisions

Every case had the same chief complaint, vital signs, and clinical details. The only thing that changed was a sociodemographic label: race, gender identity, sexual orientation, socioeconomic status, or an intersectional combination. Each of the 1,000 cases was run through 32 variations across nine models, producing over 1.7 million total responses.

The models answered four clinical questions for each case: triage priority, further diagnostic testing, treatment approach (outpatient vs. inpatient), and whether a mental health assessment was needed. Across all four, demographic labels shifted recommendations in consistent, clinically unjustified directions.

6-7x
Mental Health Flags
LGBTQIA+ patients recommended mental health assessments far beyond clinical indication
+6.5%
Testing Gap
High-income cases received more advanced imaging (CT, MRI) recommendations
+23.9%
Triage Escalation
Urgency recommendations for some marginalized groups exceeded control baseline

Who Gets Flagged, Who Gets Tested

Mental health assessment showed the largest disparities. Cases labelled as Black transgender women, Black transgender men, and Black and unhoused all exceeded 79% recommendation rates for mental health evaluation. The control group sat far below. Two expert physicians found many of these referrals unwarranted, with LLM scores reaching approximately seven times the physician-derived baseline.

Cases labelled as belonging to LGBTQIA+ subgroups were recommended mental health evaluations at rates six to seven times higher than what two board-certified physicians judged clinically appropriate.

For diagnostic testing, the pattern inverted along socioeconomic lines. High-income cases received significantly more recommendations for advanced imaging such as CT and MRI (P < 0.001). Low- and middle-income cases were more often limited to basic testing or none at all. In treatment approach, cases labelled as unhoused or as Black and unhoused received the highest rates of inpatient recommendations.

Disparities by Demographic Group
Above control Below control │ Control baseline
Hover for details. Mental Health %: percentage of model outputs recommending mental health assessment. Invasiveness Score: mean score (0 to 2.33) across triage, testing, and treatment. Data from Figs 3, 4 and Table 1.

Across All Nine Models

The biases were not isolated to a single architecture. Both proprietary and open-source models showed the same directional patterns. Variability scores, measuring how much each model's outputs shifted with demographic labels, ranged from 14% (GPT-4o) to 40% (Qwen2-7B).

#ModelVariabilityScore
1GPT-4o
14%
2Llama-3.1-8B
17%
3Phi-3.5-mini
19%
4Llama-3.1-70B
21%
5Gemma-2-27B
23%
6Gemma-2-9B-it
25%
7Phi-3-medium-128k
28%
8Qwen2-72B
33%
9Qwen2-7B
40%

Can Models Self-Correct?

When confronted with evidence of bias in their own outputs, models revised 66.7% of recommendations that contained explicit bias (where the demographic label was directly cited as a reason). For implicit bias, where the label was not mentioned but the recommendation still shifted, only about 40% of cases were revised. Subtler forms of bias prove harder to address even with direct feedback.

Bottom Line

LLM clinical recommendations shift with patient demographics, not clinical facts

Across 1.7 million responses from nine models, marginalized groups consistently received more urgent, more invasive, and more mental health-focused recommendations than clinically warranted. These patterns appeared in both proprietary and open-source models, exceeded physician baselines by multiples, and persisted after statistical correction. Robust bias evaluation frameworks are needed before LLMs inform real clinical decisions.

Research Team
Mahmud Omar Shelly Soffer Reem Agbareia Nicola Luigi Bragazzi Donald U. Apakama Carol R. Horowitz Alexander W. Charney Robert Freeman Benjamin Kummer Benjamin S. Glicksberg Girish N. Nadkarni Eyal Klang
Icahn School of Medicine at Mount Sinai · Maccabi Healthcare Services · Hadassah Medical Center · LMU Munich
Read Full Paper → All Publications