One Agent Cannot Do It All
Large language models are already handling clinical tasks: retrieving literature, extracting data from notes, and computing drug doses. But real clinical environments do not serve tasks one at a time. A busy hospital system might need an LLM to handle 40 or 80 heterogeneous requests simultaneously. What happens to accuracy when the workload scales?
This study tested two architectures head to head. A single-agent design gives one model the entire batch, with access to all tools. An orchestrated multi-agent design uses a lightweight controller that reads each task, routes it to a specialist worker (a retriever, a calculator, or a note fetcher), and reassembles the results. Batch sizes ranged from 5 to 80 tasks across three clinical task types.
Accuracy Diverges at Scale
At small batch sizes, both architectures performed well. With just five tasks, the single agent scored 73.1% and the multi-agent system scored 90.6%. But as batch size increased, the gap widened dramatically. By 80 concurrent tasks, single-agent accuracy had fallen to 16.6% while multi-agent accuracy held at 65.3%.
The pattern was consistent across all four models tested. GPT-4.1-mini showed the starkest contrast: its multi-agent accuracy stayed between 96% and 91.4% across all batch sizes, while the single agent dropped from 96% to 33.9%. Llama-2-70B and Qwen-3-8B showed the same directional trend, though several large single-agent batches for these open-weight models failed entirely, returning no valid JSON.
All differences between topologies beyond ten tasks were significant (Welch t-test, FDR-adjusted p < 0.01).
Why Orchestration Works
The key mechanism is context insulation. In the single-agent setup, every task shares one context window. As more unrelated prompts pile in, attention is diluted across irrelevant tokens, a phenomenon consistent with the documented "lost-in-the-middle" effect. The model struggles to find the signal in an increasingly noisy prompt.
The orchestrator solves this by constraining each worker to a single tool and a short prompt. No worker ever sees tokens that do not matter for its task. The orchestrator itself only routes, it never reasons over clinical content. This keeps every intermediate call within the range the model was tuned to handle.
Token Efficiency
Multi-agent runs were dramatically more token-efficient. For GPT-4.1-mini at 80 tasks, the multi-agent batch used roughly 60,000 tokens. The single agent consumed approximately 3.9 million, a 65-fold difference. This gap matters for cost and for practical deployment at scale.
Divide the work, keep the accuracy.
A lightweight orchestrator that routes tasks to dedicated workers sustains accuracy and controls cost as workload scales, while a single agent collapses under the same load. This is the first study to show that task-level delegation turns existing LLMs into reliable, token-efficient teams for complex clinical workflows.