From Adults to Children
Across our prior work, large language models swung clinical recommendations dramatically based on a patient's label. Ethnicity, identity, housing status: none of it clinically relevant to the case at hand. The obvious next question was whether the same patterns appear in pediatric emergencies, where decisions are urgent, caregivers carry weight, and the consequences last a lifetime.
This study evaluated an ensemble of 10 contemporary LLMs (both proprietary and open-source) on 500 validated synthetic vignettes and 500 deidentified real triage notes from Mount Sinai's pediatric emergency department. Each case was presented as a control, then replayed in 52 additional variants that inserted a single or intersectional sociodemographic modifier: 20 child descriptors and 32 caregiver descriptors covering race, ethnicity, immigration status, income, housing stability, gender identity, sexual orientation, and intersectional combinations.
For every case, models answered seven yes-or-no clinical questions: urgency, basic investigations, radiological imaging, hospital admission, suspicion of child maltreatment or abuse, social services involvement, and caregiver mental health assessment. Outputs were compared to a physician-derived ground truth built by two pediatric clinicians with high inter-rater agreement (Cohen's kappa 0.945).
Same Case, Different Care
Labels shifted clinical recommendations in consistent, clinically unjustified directions. The largest deviations clustered around socioeconomic adversity, particularly housing instability. Label a child as "Black unhoused" and the ensemble pushed urgent interventions by +10.5 percentage points, additional investigations by +14.1 pp, and suspicion of maltreatment by +26.6 pp, all compared with the same clinical case carrying no sociodemographic label (all adjusted P < .001).
Composite social and mental health recommendations rose by as much as +44.9 pp for unhoused children and +44.3 pp for Black unhoused children. Across all groups, model outputs consistently exceeded the physician ground truth, with the overall composite diverging by +17.7 pp (model 41.5%, physicians 23.9%) and the intervention composite by +15.5 pp.
The same standardized case labeled "Black unhoused" received +26.6 pp higher suspicion of maltreatment and +14.1 pp more investigations than the unlabeled control, with nothing in the clinical content to justify either shift.
High-income groups moved in the opposite direction. The ensemble produced consistently lower recommendations across most clinical keys, from roughly -3 to -6 pp in urgency and admission categories. Smaller but statistically significant variations appeared for ethnicity-alone groups (Black, Hispanic, Asian, Middle Eastern) and for gender-alone groups, ranging from +1 to +22 pp depending on the category.
The Caregiver's Shadow
Here is the part that genuinely surprised us. Keep every detail of the child's presentation identical. Change only the caregiver's label. Recommendations for the child still shifted, and the magnitude was not much smaller.
Social composite scores rose +43.21 pp when the caregiver was labeled "unhoused" and +41.24 pp when labeled "Black unhoused," compared with +44.88 pp and +44.32 pp when the same labels were applied to the child. The parent's label bled into the child's care, with no clinical reason for it to.
Across composites, child demographics still produced larger deviations than caregiver demographics overall. But the gap was smaller than expected. For maltreatment risk specifically, caregiver labels produced the largest shifts in the entire dataset: Black unhoused caregiver +26.6 pp, white unhoused caregiver +22.5 pp, unhoused caregiver +19.4 pp. None of these were features of the child's case.
Where the Gaps Concentrate
Across the seven clinical keys, the largest divergences clustered in social services and mental health, especially for unhoused and Black unhoused groups. The smallest were in radiology, where shifts were significant but modest. High-income groups consistently received fewer recommendations across every category.
Intersectionality involving Black race consistently intensified the differences. Where white unhoused children shifted by +44.0 pp on the social composite, Black unhoused children shifted by +44.3 pp. The gap was larger in some categories: Black low-income children diverged +22.4 pp versus +19.7 pp for white low-income children. The same direction held among caregivers, where Black unhoused caregivers (+41.2 pp) diverged more than white unhoused caregivers (+38.5 pp).
Bias, Vigilance, or Both
Some elevated responses for high-risk groups may be clinically appropriate. Children in unstable housing face documented health risks. Heightened vigilance for these children can be justified. But the magnitudes observed exceeded both physician judgment and the model's own baseline by enough to raise concern about oversimplification, particularly for indicators like low income alone or recent immigrant status.
The findings can be read as bias, as appropriate sensitivity, or as something in between, and the study design cannot fully separate the three. What is clear is the consistency and magnitude. Across millions of runs, across both synthetic and real datasets, and across 10 different models, the same labels moved the same recommendations in the same directions. That is the pattern that needs scrutiny before these tools sit in triage workflows.
Possible remedies include guideline-anchored safeguards built into model prompts (for example, requiring the model to justify recommendations that deviate from clinical norms), region-specific fine-tuning, and smaller, context-aware models. None of these are settled solutions. But the evidence base is now large enough to make the audit-and-redesign cycle a standing requirement, not a one-time exercise.
A child's label, or even a caregiver's label, can change the care the model recommends.
Across 3.7 million outputs from 10 LLMs on 1,000 pediatric emergency cases, sociodemographic labels systematically shifted clinical recommendations beyond the physician baseline. Intersectional Black-race labels intensified the effects. Caregiver labels moved the child's care by nearly the same magnitude as labels applied to the child. Clinicians using AI-assisted pediatric decision support should remain cautious with sociodemographic content in prompts, particularly when outputs deviate from guidelines or clinical judgment.