Same Pain, Different Prescriptions
Pain is among the most common reasons people visit an emergency department. LLMs are increasingly positioned to assist clinicians with treatment decisions, from suggesting analgesics to flagging risk factors. But when demographic labels are the only thing that changes in otherwise identical clinical scenarios, should the recommendations change too?
They do. Across 1,000 physician-validated acute-pain vignettes, split evenly between cancer and non-cancer settings, LLMs produced systematically different opioid prescribing, risk scores, and psychosocial flags based solely on demographic identifiers. The control group received no demographic labels. Every other group did, and the gaps were large.
Individuals identified as Black and unhoused had the highest odds of opioid recommendation (OR = 1.73; P < 0.001), followed by unhoused individuals without a racial identifier (OR = 1.64) and White unhoused individuals (OR = 1.61). Meanwhile, having low income (OR = 0.78) or middle income (OR = 0.72) actually reduced the odds below 1.00, meaning these groups were less likely than the control to receive opioids.
The Paradox of High Risk, High Prescribing
What makes these patterns especially troubling is the internal inconsistency. The same models that flagged marginalized subgroups as higher risk for addiction and drug-seeking behavior simultaneously recommended stronger opioid regimens. Unhoused and LGBTQIA+ individuals received risk scores more than 30% above control, yet their opioid recommendation rates climbed in tandem.
Groups labelled as high risk for addiction and misuse were the same groups receiving the most aggressive opioid recommendations, sometimes exceeding 90% in cancer settings.
This disconnect extended to monitoring. Unemployed and low-income subgroups were assigned elevated risk scores but received fewer opioid recommendations, while monitoring intensified. The models appeared to treat demographic attributes as proxies for unobserved clinical factors, channeling social information into prescribing logic rather than adjusting to the actual pain presentation.
Mental Health and Anxiety Flags
Disparities were not limited to opioid prescribing. The need for anxiety treatment was highest among Black unhoused individuals (OR = 1.48), unhoused individuals without a racial identifier (OR = 1.48), and White unhoused individuals (OR = 1.26). Black transgender women (OR = 1.26) and low-income individuals (OR = 1.23) were also disproportionately flagged.
Perceived psychological stress followed a similar pattern. In non-cancer cases, LLMs indicated that stress affected pain most strongly for Black unhoused individuals (OR = 8.35), followed by those without a specified racial identity (OR = 7.45) and White unhoused individuals (OR = 5.51). These flags risk labelling genuine health needs as primarily psychological, diverting clinical attention when other interventions may be more appropriate.
What the Models Said About Their Own Decisions
In a targeted analysis of GPT-4o's reasoning, physicians independently coded whether demographic labels appeared in clinical explanations and whether those labels were cited as causal. In non-cancer vignettes, the model mentioned demographics 82% of the time for Black patients versus 38% for White patients. For explicitly causal mentions, the gap widened to 50% vs. 6%.
In cancer vignettes, mention rates rose across the board (90% for Black, 62% for White), but the causal attribution gap persisted. Low-income variants showed a similar asymmetry: 42% any-mention (non-cancer) with 36% causal, compared to high-income variants at 80% any-mention with only 18% causal. The model was more likely to treat disadvantaged demographics as explanatory factors for its clinical decisions.
Prescribing Variation Across Models
All ten models showed socio-demographic gaps, but their magnitude varied. Opioid recommendation rates in non-cancer cases ranged from 36% (low income) to 41% (unhoused), and in cancer cases from 77% (non-binary) to 85% (Black unhoused). Below are the overall non-cancer opioid rates by model category.
Both open-source and closed-source models produced comparable control baselines (37.7% non-cancer opioid rate, 79.5% cancer rate). The direction of bias was consistent across model families: marginalized demographics skewed upward, while White and high-income variants trended at or below baseline. The cancer label amplified overall prescribing but did not uniformly reduce disparities.
LLM pain management recommendations systematically vary by demographic label, not by clinical need
Across 3.4 million responses, models prescribed more opioids to historically marginalized groups while simultaneously flagging them as higher risk. These patterns diverge from standard guidelines and point to model-driven bias rather than acceptable clinical variation. Rigorous bias evaluation and guideline-based guardrails are essential before LLMs inform real prescribing decisions.