← All Publications
Pediatrics 2026 Pediatric AI Bias

Sociodemographic Variability in Pediatric Emergency Decisions by AI

An ensemble of 10 large language models answered the same pediatric emergency cases relabeled with 52 sociodemographic variants per case. Across 3.7 million model outputs, the same clinical scenarios produced very different recommendations. The most surprising result: changing only the caregiver's label, with every detail of the child unchanged, shifted the child's care by nearly the same magnitude.

3.7M+
Model Outputs
10
LLMs Tested
1,000
Pediatric Cases
52
Demographic Variants

From Adults to Children

Across our prior work, large language models swung clinical recommendations dramatically based on a patient's label. Ethnicity, identity, housing status: none of it clinically relevant to the case at hand. The obvious next question was whether the same patterns appear in pediatric emergencies, where decisions are urgent, caregivers carry weight, and the consequences last a lifetime.

This study evaluated an ensemble of 10 contemporary LLMs (both proprietary and open-source) on 500 validated synthetic vignettes and 500 deidentified real triage notes from Mount Sinai's pediatric emergency department. Each case was presented as a control, then replayed in 52 additional variants that inserted a single or intersectional sociodemographic modifier: 20 child descriptors and 32 caregiver descriptors covering race, ethnicity, immigration status, income, housing stability, gender identity, sexual orientation, and intersectional combinations.

For every case, models answered seven yes-or-no clinical questions: urgency, basic investigations, radiological imaging, hospital admission, suspicion of child maltreatment or abuse, social services involvement, and caregiver mental health assessment. Outputs were compared to a physician-derived ground truth built by two pediatric clinicians with high inter-rater agreement (Cohen's kappa 0.945).

1,000
Pediatric Cases
500 validated synthetic vignettes plus 500 real Mount Sinai triage notes
7
Clinical Questions
Urgency, labs, imaging, admission, maltreatment, social services, caregiver mental health
52
Sociodemographic Variants
20 child descriptors and 32 caregiver descriptors covering identity, income, housing, and intersections

Same Case, Different Care

Labels shifted clinical recommendations in consistent, clinically unjustified directions. The largest deviations clustered around socioeconomic adversity, particularly housing instability. Label a child as "Black unhoused" and the ensemble pushed urgent interventions by +10.5 percentage points, additional investigations by +14.1 pp, and suspicion of maltreatment by +26.6 pp, all compared with the same clinical case carrying no sociodemographic label (all adjusted P < .001).

Composite social and mental health recommendations rose by as much as +44.9 pp for unhoused children and +44.3 pp for Black unhoused children. Across all groups, model outputs consistently exceeded the physician ground truth, with the overall composite diverging by +17.7 pp (model 41.5%, physicians 23.9%) and the intervention composite by +15.5 pp.

The same standardized case labeled "Black unhoused" received +26.6 pp higher suspicion of maltreatment and +14.1 pp more investigations than the unlabeled control, with nothing in the clinical content to justify either shift.

High-income groups moved in the opposite direction. The ensemble produced consistently lower recommendations across most clinical keys, from roughly -3 to -6 pp in urgency and admission categories. Smaller but statistically significant variations appeared for ethnicity-alone groups (Black, Hispanic, Asian, Middle Eastern) and for gender-alone groups, ranging from +1 to +22 pp depending on the category.

Social Composite Shift From Control
Mean delta in social/mental-health composite recommendations vs. the same case with no sociodemographic label.
Above control (more recommendations) Below control
Hover for details. Values are percentage-point deviations in the social composite score (mean of maltreatment, social services, and mental-health items) compared with the same case carrying no sociodemographic label. Data derived from Figure 4 and the Results section of the paper.

The Caregiver's Shadow

Here is the part that genuinely surprised us. Keep every detail of the child's presentation identical. Change only the caregiver's label. Recommendations for the child still shifted, and the magnitude was not much smaller.

Social composite scores rose +43.21 pp when the caregiver was labeled "unhoused" and +41.24 pp when labeled "Black unhoused," compared with +44.88 pp and +44.32 pp when the same labels were applied to the child. The parent's label bled into the child's care, with no clinical reason for it to.

Unhoused, child label
+44.9 pp
Social composite shift when the child is labeled unhoused.
Unhoused, caregiver label
+43.2 pp
The same shift, almost in full, when only the caregiver is labeled.
Black unhoused, child label
+44.3 pp
Intersectional shift when the child carries both labels.
Black unhoused, caregiver label
+41.2 pp
Comparable shift when only the caregiver carries both labels.

Across composites, child demographics still produced larger deviations than caregiver demographics overall. But the gap was smaller than expected. For maltreatment risk specifically, caregiver labels produced the largest shifts in the entire dataset: Black unhoused caregiver +26.6 pp, white unhoused caregiver +22.5 pp, unhoused caregiver +19.4 pp. None of these were features of the child's case.

Where the Gaps Concentrate

Across the seven clinical keys, the largest divergences clustered in social services and mental health, especially for unhoused and Black unhoused groups. The smallest were in radiology, where shifts were significant but modest. High-income groups consistently received fewer recommendations across every category.

Clinical KeyTop Divergent GroupΔ pp
Social ServicesUnhoused child+68.1
Mental HealthUnhoused child+50.3
MaltreatmentBlack unhoused caregiver+26.6
AdmissionBlack unhoused child+15.5
LabsBlack unhoused child+14.1
UrgencyBlack unhoused child+10.5
RadiologyBlack transgender caregiver+8.3

Intersectionality involving Black race consistently intensified the differences. Where white unhoused children shifted by +44.0 pp on the social composite, Black unhoused children shifted by +44.3 pp. The gap was larger in some categories: Black low-income children diverged +22.4 pp versus +19.7 pp for white low-income children. The same direction held among caregivers, where Black unhoused caregivers (+41.2 pp) diverged more than white unhoused caregivers (+38.5 pp).

Bias, Vigilance, or Both

Some elevated responses for high-risk groups may be clinically appropriate. Children in unstable housing face documented health risks. Heightened vigilance for these children can be justified. But the magnitudes observed exceeded both physician judgment and the model's own baseline by enough to raise concern about oversimplification, particularly for indicators like low income alone or recent immigrant status.

The findings can be read as bias, as appropriate sensitivity, or as something in between, and the study design cannot fully separate the three. What is clear is the consistency and magnitude. Across millions of runs, across both synthetic and real datasets, and across 10 different models, the same labels moved the same recommendations in the same directions. That is the pattern that needs scrutiny before these tools sit in triage workflows.

Possible remedies include guideline-anchored safeguards built into model prompts (for example, requiring the model to justify recommendations that deviate from clinical norms), region-specific fine-tuning, and smaller, context-aware models. None of these are settled solutions. But the evidence base is now large enough to make the audit-and-redesign cycle a standing requirement, not a one-time exercise.

Bottom Line

A child's label, or even a caregiver's label, can change the care the model recommends.

Across 3.7 million outputs from 10 LLMs on 1,000 pediatric emergency cases, sociodemographic labels systematically shifted clinical recommendations beyond the physician baseline. Intersectional Black-race labels intensified the effects. Caregiver labels moved the child's care by nearly the same magnitude as labels applied to the child. Clinicians using AI-assisted pediatric decision support should remain cautious with sociodemographic content in prompts, particularly when outputs deviate from guidelines or clinical judgment.

Research Team
Mahmud Omar Reem Agbareia Razi Abu Salah Nicola Luigi Bragazzi Donald U. Apakama Benjamin S. Glicksberg Bruce D. Gelb Emma Holmes Girish N. Nadkarni Eyal Klang
Icahn School of Medicine at Mount Sinai · Hadassah Medical Center · LMU Munich · BIDMC · Harvard Medical School
Read Full Paper → All Publications