← All Publications
npj Health Systems 2026 Multi-Agent Systems

Orchestrated Multi Agents Sustain Accuracy Under Clinical-Scale Workloads Compared to a Single Agent

When dozens of clinical tasks hit an LLM at once, a single agent collapses. A lightweight orchestrator that delegates each task to a dedicated worker kept accuracy above 90% while single-agent accuracy dropped to 16.6%, using up to 65 times fewer tokens in the process.

90.6%
Multi-Agent Accuracy
65×
Fewer Tokens
4
Models Tested
16.6%
Single-Agent at 80 Tasks

One Agent Cannot Do It All

Large language models are already handling clinical tasks: retrieving literature, extracting data from notes, and computing drug doses. But real clinical environments do not serve tasks one at a time. A busy hospital system might need an LLM to handle 40 or 80 heterogeneous requests simultaneously. What happens to accuracy when the workload scales?

This study tested two architectures head to head. A single-agent design gives one model the entire batch, with access to all tools. An orchestrated multi-agent design uses a lightweight controller that reads each task, routes it to a specialist worker (a retriever, a calculator, or a note fetcher), and reassembles the results. Batch sizes ranged from 5 to 80 tasks across three clinical task types.

234,650
Retrieval
PubMed oncology abstracts searched via FAISS index using k-NN retrieval
331,793
Extraction
EHR summaries with admission and discharge fields to extract
20
Dosing
Medication math templates covering weight, surface area, and clearance

Accuracy Diverges at Scale

At small batch sizes, both architectures performed well. With just five tasks, the single agent scored 73.1% and the multi-agent system scored 90.6%. But as batch size increased, the gap widened dramatically. By 80 concurrent tasks, single-agent accuracy had fallen to 16.6% while multi-agent accuracy held at 65.3%.

The pattern was consistent across all four models tested. GPT-4.1-mini showed the starkest contrast: its multi-agent accuracy stayed between 96% and 91.4% across all batch sizes, while the single agent dropped from 96% to 33.9%. Llama-2-70B and Qwen-3-8B showed the same directional trend, though several large single-agent batches for these open-weight models failed entirely, returning no valid JSON.

All differences between topologies beyond ten tasks were significant (Welch t-test, FDR-adjusted p < 0.01).

Accuracy by Batch Size
Multi-Agent Single-Agent

Why Orchestration Works

The key mechanism is context insulation. In the single-agent setup, every task shares one context window. As more unrelated prompts pile in, attention is diluted across irrelevant tokens, a phenomenon consistent with the documented "lost-in-the-middle" effect. The model struggles to find the signal in an increasingly noisy prompt.

The orchestrator solves this by constraining each worker to a single tool and a short prompt. No worker ever sees tokens that do not matter for its task. The orchestrator itself only routes, it never reasons over clinical content. This keeps every intermediate call within the range the model was tuned to handle.

Token Efficiency

Multi-agent runs were dramatically more token-efficient. For GPT-4.1-mini at 80 tasks, the multi-agent batch used roughly 60,000 tokens. The single agent consumed approximately 3.9 million, a 65-fold difference. This gap matters for cost and for practical deployment at scale.

#
Model
Token Ratio
Multi-Agent
1
GPT-4.1 Mini
65×
2
GPT-4.1 Nano
~50×
3
Llama-2-70B
~30×
4
Qwen-3-8B
~20×
Bottom Line

Divide the work, keep the accuracy.

A lightweight orchestrator that routes tasks to dedicated workers sustains accuracy and controls cost as workload scales, while a single agent collapses under the same load. This is the first study to show that task-level delegation turns existing LLMs into reliable, token-efficient teams for complex clinical workflows.

Research Team
Eyal Klang Mahmud Omar Ganesh Raut Reem Agbareia Prem Timsina Robert Freeman Nicholas Gavin Lisa Stump Alexander W. Charney Benjamin S. Glicksberg Girish N. Nadkarni
Mount Sinai · Hadassah Medical Center · Hasso Plattner Institute · Icahn School of Medicine
Read Full Paper → All Publications