Sociodemographic bias in LLM clinical trial screening

Two Trends, Opposite Directions

Large language models are now routinely tasked with cohort selection and eligibility screening for randomized controlled trials. Early systems perform well on benchmarks and can generate clinician-interpretable rationales for inclusion or exclusion. At the same time, a parallel literature documents systematic sociodemographic bias in LLM outputs on clinical decision support and patient-facing tasks.

The two trends point in opposite directions: rising performance and rising concern. If trial automation inherits bias, it could quietly distort who gains access to experimental therapies. We asked the natural question: do LLM-based screening decisions change with patient sociodemographic descriptors when the clinical content and trial criteria remain identical?

Same Trial, Different Identity

We sampled 58 US adult Phase II-III RCT protocols registered on ClinicalTrials.gov from 2023-2024, spanning oncology (n=13), infectious diseases (n=12), neurology and pain (n=7), and other specialties. For each protocol, we generated 15 standardized clinical vignettes: 5 clearly eligible, 5 clearly ineligible, and 5 borderline cases requiring clinical judgment. Every vignette was strictly clinical, gender-neutral, and stripped of sociodemographic, financial, geographic, and culturally coded details. Two board-certified physicians independently validated each one.

For every vignette we built a control version with no identity descriptor, then created 33 identity-perturbed variants differing only in label: gender, race or ethnicity, socioeconomic status indicators (high-income, low-income, unemployed, homeless), and sexual orientation. Each vignette-identity combination was evaluated by an ensemble of 9 LLMs (Gemma 3, Qwen2.5, Llama 3, Phi-4 variants), yielding 5,324,400 model-question evaluations. Claude 3.5 Sonnet generated the vignettes and was excluded from evaluation to avoid circularity.

Models answered 20 structured screening questions across 5 domains: eligibility assessment, risk-benefit perception, adherence and retention, resource sufficiency, and trust and attitude. The primary endpoint was eligibility. Linear mixed-effects models estimated adjusted mean differences against the matched control vignette per domain and per identity, with Benjamini-Hochberg-adjusted P-values. We pre-specified ±0.10 on the 1-5 response scale as a practical-triviality margin to avoid over-interpreting statistically detectable but practically negligible differences across 5 million evaluations.

5.3M

Evaluations

Mixed-effects models across 5,324,400 model-question runs from 9 LLMs on 58 RCT protocols

±0.10

Triviality Margin

Pre-specified band on the 1-5 scale (2.5% of range) to flag practically meaningful identity effects

100%

Direction Consistency

Homelessness, Black homelessness, and unemployment shifted negatively across every one of the 9 models

Within the Rules, Stable

Eligibility judgments were largely stable across the 33 identity labels. Most adjusted contrasts fell within ±0.05 on the 1-5 scale. A transgender woman label shifted eligibility by -0.008 (95% CI, -0.04 to 0.02; P=1.00); a White male label by +0.036 (95% CI, 0.01-0.07; P=.024). Race and ethnicity labels alone were essentially zero once socioeconomic status was accounted for, with Black at +0.008, Asian at -0.004, and White at +0.024, none reaching the ±0.10 triviality threshold.

The single notable deviation in eligibility was homelessness, at -0.121 (95% CI, -0.15 to -0.09; P<.001). That label moves outside the trivially small band, and on its own, it is the only one that does so for the primary endpoint.

Within explicit eligibility criteria, LLMs reasoned consistently across identity labels. Outside them, in adherence, resources, and trust, the same models echoed societal hierarchies, especially around homelessness.

Beyond the Rules, Disparity

The picture changed sharply in the secondary domains. Resources showed the largest socioeconomic dispersion: homeless -0.715 (95% CI, -0.73 to -0.70; P<.001), Black homeless -0.689, low-income -0.207, and high-income +0.129. Adherence followed the same gradient: homeless -0.595, Black homeless -0.542, low-income -0.138, high-income +0.087. Trust and attitude showed homeless -0.337 and Black homeless -0.372, the largest negative deviation observed in the entire study.

Risk-benefit, the most rule-bound of the secondary domains, behaved more like eligibility: effects centered near zero (range -0.086 to +0.023), with no label crossing the ±0.10 boundary. The pattern is intuitive once you see it. Domains that require explicit criteria, applied to fixed clinical content, stayed stable. Domains that invite inference about behavior or capacity moved with socioeconomic cues.

Between-model variability mirrored this gradient. Intraclass correlation was smallest for risk-benefit (ICC=0.04) and eligibility (ICC=0.12), and larger for resources (ICC=0.42) and trust/attitude (ICC=0.38). Yet the direction of the main SES effects was consistent across all nine models: homelessness and Black homelessness were negative in every model for adherence, resources, and trust. Unemployment was also consistently negative across domains, including eligibility (mean Δ -0.045; 100% negative direction).

Identity-Level Effects Across Five Domains

Adjusted mean difference vs. the matched control vignette on a 1-5 scale. The shaded gold band marks the pre-specified ±0.10 triviality margin.

Below control (lower score) Above control (higher score) ±0.10 trivially small band

Hover for details. Values are adjusted mean differences vs. the matched identity-free control vignette, on the 1-5 scoring scale used by each domain. Crimson bars indicate identities scored lower than control on that domain; teal bars indicate scores above control. Point estimates are taken from the Results section and Figure 2 of the paper; values for identities not stated explicitly in the text are derived from the published forest plot and clearly labeled here as estimates rather than reported point estimates.

Where the Largest Gaps Sit

Pulling the most divergent groups across all five domains makes the pattern obvious. Homelessness anchors the negative extreme in every secondary domain. High-income anchors the positive extreme in every secondary domain except risk-benefit. Eligibility, by contrast, has a single off-margin entry: the same homelessness label.

DomainMost Divergent GroupΔ (1-5)

EligibilityHomeless-0.121

AdherenceHomeless-0.595

ResourcesHomeless-0.715

Risk-BenefitHomeless-0.086

Trust/AttitudeBlack Homeless-0.372

Resources (+)High-income+0.129

Adherence (+)High-income+0.087

A Conditional Bias

The cleanest way to read the result is as a boundary. When prompts kept the models tethered to protocol language, identity labels barely moved their outputs. When prompts asked them to infer behavior, adherence, or trust, the same labels produced shifts an order of magnitude larger, and in the same direction across all nine models.

Within explicit criteria

-0.121

Largest eligibility shift across 33 identity labels (homeless). All other labels fell within the ±0.10 trivially small band.

Beyond explicit criteria
-0.715
Same homeless label, same 1-5 scale, on resource sufficiency. Roughly six times larger than the eligibility shift, in the same direction.

Adherence, homeless

-0.595

Negative across all nine models. The SES gradient repeats here: low-income -0.138, high-income +0.087.

Trust, Black homeless
-0.372
Largest single negative deviation in the study. Race-only labels were near zero; the gap appears at the intersection with SES.

Because homelessness is generally not an explicit eligibility criterion in Phase II-III protocols, the −0.121 eligibility shift is consistent with subjective assumptions influencing rule-based determinations rather than with codified protocol logic. The implication is that identity-linked outputs should trigger support, navigation, and logistical resources, not exclusion.

The Safeguard Is Separation

Our scope sits next to recent equity work in clinical trial matching and medical question answering, which has shown group-dependent changes in ranking and QA performance when sociodemographic cues are inserted into prompts. Here we held protocols constant, stripped prompts of social determinants of health content, and examined eligibility judgments as the endpoint. The risk we identified does not live in the eligibility logic itself. It lives in its periphery: if soft judgments about adherence, resources, or trust feed operational decisions about recruitment or resource allocation, disparities can re-enter through pathways other than the inclusion criteria.

The appropriate safeguard is separation. Keep eligibility tethered to protocol language. Treat auxiliary domains as planning variables, not filters. The findings should be interpreted as model sensitivity to identity labels under standardized prompts, not as evidence about the correctness of model predictions or the real-world validity of sociodemographic correlations.

Several limitations apply. We analyzed US adult Phase II-III trials from 2023-2024, which may not generalize to pediatric or international contexts. Socioeconomic descriptors were simplified proxies. Standardized vignettes cannot fully capture clinical nuance. Models were tested at single time points with default configurations, and newer iterations may behave differently. The ±0.10 triviality threshold is a pragmatic a priori choice on a 1-5 scale, and alternative thresholds are defensible. Because one LLM generated the vignettes and other LLMs answered the screening questions, residual generator effects remain possible despite physician validation. Future work should compare LLM-based screening with clinician assessments and evaluate model behavior in real-world trial screening workflows.

Bottom Line

Bias in LLM-assisted trial screening is conditional. The risk lives in the periphery, not in the rules themselves.

Across 5.3 million prompted runs from 9 LLMs on 58 Phase II-III RCT protocols and 33 identity labels, eligibility judgments stayed largely stable across identities. Once prompts moved beyond explicit criteria, homelessness, low-income, and unemployment shifted adherence, resources, and trust sharply downward across every model. Responsible deployment depends on preserving that boundary: keep eligibility tethered to protocol language and treat auxiliary domains as planning variables for resource navigation, not filters for exclusion.

Research Team

Shelly Soffer Mahmud Omar Orly Efros Donald U. Apakama Aya Mudrik Robert Freeman Girish N. Nadkarni Eyal Klang

Rabin Medical Center · Gray School of Medicine, Tel Aviv University · Windreich Department of AI and Human Health, Icahn School of Medicine at Mount Sinai · Hasso Plattner Institute for Digital Health at Mount Sinai · Sheba Medical Center · Institute for Health Equity Research, Mount Sinai · Ben-Gurion University of the Negev

Drs S. Soffer, M. Omar, G. N. Nadkarni, and E. Klang contributed equally to this work.

Read Full Paper → All Publications