Two Trends, Opposite Directions
Large language models are now routinely tasked with cohort selection and eligibility screening for randomized controlled trials. Early systems perform well on benchmarks and can generate clinician-interpretable rationales for inclusion or exclusion. At the same time, a parallel literature documents systematic sociodemographic bias in LLM outputs on clinical decision support and patient-facing tasks.
The two trends point in opposite directions: rising performance and rising concern. If trial automation inherits bias, it could quietly distort who gains access to experimental therapies. We asked the natural question: do LLM-based screening decisions change with patient sociodemographic descriptors when the clinical content and trial criteria remain identical?
Same Trial, Different Identity
We sampled 58 US adult Phase II-III RCT protocols registered on ClinicalTrials.gov from 2023-2024, spanning oncology (n=13), infectious diseases (n=12), neurology and pain (n=7), and other specialties. For each protocol, we generated 15 standardized clinical vignettes: 5 clearly eligible, 5 clearly ineligible, and 5 borderline cases requiring clinical judgment. Every vignette was strictly clinical, gender-neutral, and stripped of sociodemographic, financial, geographic, and culturally coded details. Two board-certified physicians independently validated each one.
For every vignette we built a control version with no identity descriptor, then created 33 identity-perturbed variants differing only in label: gender, race or ethnicity, socioeconomic status indicators (high-income, low-income, unemployed, homeless), and sexual orientation. Each vignette-identity combination was evaluated by an ensemble of 9 LLMs (Gemma 3, Qwen2.5, Llama 3, Phi-4 variants), yielding 5,324,400 model-question evaluations. Claude 3.5 Sonnet generated the vignettes and was excluded from evaluation to avoid circularity.
Models answered 20 structured screening questions across 5 domains: eligibility assessment, risk-benefit perception, adherence and retention, resource sufficiency, and trust and attitude. The primary endpoint was eligibility. Linear mixed-effects models estimated adjusted mean differences against the matched control vignette per domain and per identity, with Benjamini-Hochberg-adjusted P-values. We pre-specified ±0.10 on the 1-5 response scale as a practical-triviality margin to avoid over-interpreting statistically detectable but practically negligible differences across 5 million evaluations.
Within the Rules, Stable
Eligibility judgments were largely stable across the 33 identity labels. Most adjusted contrasts fell within ±0.05 on the 1-5 scale. A transgender woman label shifted eligibility by -0.008 (95% CI, -0.04 to 0.02; P=1.00); a White male label by +0.036 (95% CI, 0.01-0.07; P=.024). Race and ethnicity labels alone were essentially zero once socioeconomic status was accounted for, with Black at +0.008, Asian at -0.004, and White at +0.024, none reaching the ±0.10 triviality threshold.
The single notable deviation in eligibility was homelessness, at -0.121 (95% CI, -0.15 to -0.09; P<.001). That label moves outside the trivially small band, and on its own, it is the only one that does so for the primary endpoint.
Within explicit eligibility criteria, LLMs reasoned consistently across identity labels. Outside them, in adherence, resources, and trust, the same models echoed societal hierarchies, especially around homelessness.
Beyond the Rules, Disparity
The picture changed sharply in the secondary domains. Resources showed the largest socioeconomic dispersion: homeless -0.715 (95% CI, -0.73 to -0.70; P<.001), Black homeless -0.689, low-income -0.207, and high-income +0.129. Adherence followed the same gradient: homeless -0.595, Black homeless -0.542, low-income -0.138, high-income +0.087. Trust and attitude showed homeless -0.337 and Black homeless -0.372, the largest negative deviation observed in the entire study.
Risk-benefit, the most rule-bound of the secondary domains, behaved more like eligibility: effects centered near zero (range -0.086 to +0.023), with no label crossing the ±0.10 boundary. The pattern is intuitive once you see it. Domains that require explicit criteria, applied to fixed clinical content, stayed stable. Domains that invite inference about behavior or capacity moved with socioeconomic cues.
Between-model variability mirrored this gradient. Intraclass correlation was smallest for risk-benefit (ICC=0.04) and eligibility (ICC=0.12), and larger for resources (ICC=0.42) and trust/attitude (ICC=0.38). Yet the direction of the main SES effects was consistent across all nine models: homelessness and Black homelessness were negative in every model for adherence, resources, and trust. Unemployment was also consistently negative across domains, including eligibility (mean Δ -0.045; 100% negative direction).
Where the Largest Gaps Sit
Pulling the most divergent groups across all five domains makes the pattern obvious. Homelessness anchors the negative extreme in every secondary domain. High-income anchors the positive extreme in every secondary domain except risk-benefit. Eligibility, by contrast, has a single off-margin entry: the same homelessness label.
A Conditional Bias
The cleanest way to read the result is as a boundary. When prompts kept the models tethered to protocol language, identity labels barely moved their outputs. When prompts asked them to infer behavior, adherence, or trust, the same labels produced shifts an order of magnitude larger, and in the same direction across all nine models.
Because homelessness is generally not an explicit eligibility criterion in Phase II-III protocols, the −0.121 eligibility shift is consistent with subjective assumptions influencing rule-based determinations rather than with codified protocol logic. The implication is that identity-linked outputs should trigger support, navigation, and logistical resources, not exclusion.
The Safeguard Is Separation
Our scope sits next to recent equity work in clinical trial matching and medical question answering, which has shown group-dependent changes in ranking and QA performance when sociodemographic cues are inserted into prompts. Here we held protocols constant, stripped prompts of social determinants of health content, and examined eligibility judgments as the endpoint. The risk we identified does not live in the eligibility logic itself. It lives in its periphery: if soft judgments about adherence, resources, or trust feed operational decisions about recruitment or resource allocation, disparities can re-enter through pathways other than the inclusion criteria.
The appropriate safeguard is separation. Keep eligibility tethered to protocol language. Treat auxiliary domains as planning variables, not filters. The findings should be interpreted as model sensitivity to identity labels under standardized prompts, not as evidence about the correctness of model predictions or the real-world validity of sociodemographic correlations.
Several limitations apply. We analyzed US adult Phase II-III trials from 2023-2024, which may not generalize to pediatric or international contexts. Socioeconomic descriptors were simplified proxies. Standardized vignettes cannot fully capture clinical nuance. Models were tested at single time points with default configurations, and newer iterations may behave differently. The ±0.10 triviality threshold is a pragmatic a priori choice on a 1-5 scale, and alternative thresholds are defensible. Because one LLM generated the vignettes and other LLMs answered the screening questions, residual generator effects remain possible despite physician validation. Future work should compare LLM-based screening with clinician assessments and evaluate model behavior in real-world trial screening workflows.
Bias in LLM-assisted trial screening is conditional. The risk lives in the periphery, not in the rules themselves.
Across 5.3 million prompted runs from 9 LLMs on 58 Phase II-III RCT protocols and 33 identity labels, eligibility judgments stayed largely stable across identities. Once prompts moved beyond explicit criteria, homelessness, low-income, and unemployment shifted adherence, resources, and trust sharply downward across every model. Responsible deployment depends on preserving that boundary: keep eligibility tethered to protocol language and treat auxiliary domains as planning variables for resource navigation, not filters for exclusion.