Multi-Agent Orchestration Sustains Accuracy Under Clinical Workloads

One Agent Cannot Do It All

Large language models are already handling clinical tasks: retrieving literature, extracting data from notes, and computing drug doses. But real clinical environments do not serve tasks one at a time. A busy hospital system might need an LLM to handle 40 or 80 heterogeneous requests simultaneously. What happens to accuracy when the workload scales?

This study tested two architectures head to head. A single-agent design gives one model the entire batch, with access to all tools. An orchestrated multi-agent design uses a lightweight controller that reads each task, routes it to a specialist worker (a retriever, a calculator, or a note fetcher), and reassembles the results. Batch sizes ranged from 5 to 80 tasks across three clinical task types.

234,650

Retrieval

PubMed oncology abstracts searched via FAISS index using k-NN retrieval

331,793

Extraction

EHR summaries with admission and discharge fields to extract

Dosing

Medication math templates covering weight, surface area, and clearance

Accuracy Diverges at Scale

At small batch sizes, both architectures performed well. With just five tasks, the single agent scored 73.1% and the multi-agent system scored 90.6%. But as batch size increased, the gap widened dramatically. By 80 concurrent tasks, single-agent accuracy had fallen to 16.6% while multi-agent accuracy held at 65.3%.

The pattern was consistent across all four models tested. GPT-4.1-mini showed the starkest contrast: its multi-agent accuracy stayed between 96% and 91.4% across all batch sizes, while the single agent dropped from 96% to 33.9%. Llama-2-70B and Qwen-3-8B showed the same directional trend, though several large single-agent batches for these open-weight models failed entirely, returning no valid JSON.

All differences between topologies beyond ten tasks were significant (Welch t-test, FDR-adjusted p < 0.01).

Accuracy by Batch Size

Multi-Agent Single-Agent

Why Orchestration Works

The key mechanism is context insulation. In the single-agent setup, every task shares one context window. As more unrelated prompts pile in, attention is diluted across irrelevant tokens, a phenomenon consistent with the documented "lost-in-the-middle" effect. The model struggles to find the signal in an increasingly noisy prompt.

The orchestrator solves this by constraining each worker to a single tool and a short prompt. No worker ever sees tokens that do not matter for its task. The orchestrator itself only routes, it never reasons over clinical content. This keeps every intermediate call within the range the model was tuned to handle.

Token Efficiency

Multi-agent runs were dramatically more token-efficient. For GPT-4.1-mini at 80 tasks, the multi-agent batch used roughly 60,000 tokens. The single agent consumed approximately 3.9 million, a 65-fold difference. This gap matters for cost and for practical deployment at scale.

Model

Token Ratio

Multi-Agent

GPT-4.1 Mini

65×

GPT-4.1 Nano

~50×

Llama-2-70B

~30×

Qwen-3-8B

~20×

Bottom Line

Divide the work, keep the accuracy.

A lightweight orchestrator that routes tasks to dedicated workers sustains accuracy and controls cost as workload scales, while a single agent collapses under the same load. This is the first study to show that task-level delegation turns existing LLMs into reliable, token-efficient teams for complex clinical workflows.

Research Team

Eyal Klang Mahmud Omar Ganesh Raut Reem Agbareia Prem Timsina Robert Freeman Nicholas Gavin Lisa Stump Alexander W. Charney Benjamin S. Glicksberg Girish N. Nadkarni

Mount Sinai · Hadassah Medical Center · Hasso Plattner Institute · Icahn School of Medicine

Read Full Paper → All Publications