Multi-agent: when one model is not enough

Multi-agent is in fashion. Every new paper adds another role: a planner, a critic, a reviewer, a retriever. The temptation to put five agents together for one task is very high. Almost always it is wrong.

Let me describe the four multi-agent patterns we use in production, when we have turned them on and when we have turned them off.

When you do NOT need multi-agent

First the rule of thumb: for linear tasks, with well-defined output and prompt sizes under 30-50k input tokens, a single model with a careful prompt beats every multi-agent setup, costs less, and is easier to debug. Most SME automation tasks live there.

Multi-agent becomes useful when: the output has natural stages that require "controlled loss of state", or you explicitly need an outside perspective (critique), or the tools and context windows do not fit inside a single model.

Pattern 1: Pipeline (cascade)

The simplest. Agent A produces output, agent B reads A's output and transforms it, agent C reads B's output. Each agent has a narrow, well-specified task.

When we use it. Drafting complex documents: the RESEARCH agent first finds the relevant case law, the DRAFTING agent then writes the draft using only what A found, the REVIEW agent finally fact-checks and flags inconsistencies.

Why it works. Each agent sees only the context it needs. No "reason over 30 thousand tokens at every step". Prompts stay sharp.

Tradeoff. Latency adds up. If A takes 8 seconds, B 10, C 12, you are at 30 seconds response time. Fine for batch tasks, sometimes not what you want for live UX.

Pitfall. "Amplified error". If A produces something slightly wrong, B does not notice and builds on top of it. Mitigation: teach B to push back on A when the input smells off. It is a patch, not a cure.

Pattern 2: Supervisor + Workers

An "orchestrator" agent decides what to do, and at runtime spins up specialised agents for specific pieces of the task. It is the tool-use pattern elevated to a system.

When we use it. Open-ended tasks, where you do not know the steps in advance. For example: "produce a Q1 performance report for this client". The orchestrator decides to call first a SQL agent to extract the data, then a VISUAL agent to draw charts, then a NARRATIVE agent to write the prose, then a FACT-CHECK agent to validate the numbers.

Why it works. Adaptable. The same orchestrator handles hundreds of task variants without explicit pre-programming.

Tradeoff. Complexity. You have to put in very good logging, otherwise debugging is impossible. And the orchestrator model has to be robust: we typically use a Claude Opus or GPT-4o, never small models, even though it costs more.

Pitfall. "Over-orchestration". The orchestrator decides to call 8 agents when 2 would have been enough. You burn budget. Mitigation: hard budget in the system prompt ("maximum 5 sub-agent calls, beyond that you must stop").

Pattern 3: Debate (two agents pushing back on each other)

Two agents (or two instances of the same model with different prompts) each produce their solution to the problem. A third "judge" agent picks the better one, or a fourth round asks the two to revise each other.

When we use it. Complex decisions where no ground truth exists and you need robustness. For example: reviewing an ambiguous contract clause — the PRO agent defends it, the CON agent attacks it, the JUDGE reads both and produces the final note to the partner.

Why it works. A pattern known from Anthropic, OpenAI, DeepMind work: debate reduces hallucinations because anyone who lies is more likely to be unmasked by another model looking for holes.

Tradeoff. Cost. You are at 3x. And high latency.

Pitfall. "Echo chamber". If the two "debaters" are both LLMs with the same policy, they tend to agree. Diversify: different temperatures, different prompts assigning roles, ideally different providers (Claude vs GPT).

Pattern 4: Reflection (agent reviewing itself)

A single agent, two steps: production → critique → revision. Same model, growing context. It is a light version of debate where the "judge" is the same one that produced.

When we use it. Outputs that need polish: summaries of long documents, delicate translations, advertising copywriting.

Why it works. Almost free (one extra round, same model). Improves polish by 15-25% on average in our tests.

Tradeoff. Diminishing returns after the first round of critique. Three, four rounds do not help more than one or two.

Pitfall. "Over-refinement". On simple tasks the critique finds non-existent problems and the model fixes them, making the output worse. Mitigation: teach the model that "if you find no substantive problems, leave the first draft as is".

The metric we use to decide

To decide whether a task should go with a single prompt, pipeline, supervisor, debate or reflection:

◆Single prompt when: predictable output, context under 30k tokens, latency matters.
◆Pipeline when: output has clear stages, latency is not critical, you want a cap on cost.
◆Supervisor when: open task, path not pre-determined, ok to spend.
◆Debate when: high-value decision, no ground truth, ok to pay 3x.
◆Reflection when: you want more polish with a small cost increase.

Almost never combine more than one pattern. Multi-agent + reflection + debate = burnt budget and impossible debugging.

What we have dismantled

Three multi-agent setups that seemed clever and we later uninstalled:

◆"Orchestrator + 4 specialists" for a task that turned out to be answerable with a single, carefully written 3,000-token prompt. We had added complexity out of fashion, not necessity.
◆"Debate" on tier-1 customer service answers. It was 3x more expensive for zero measurable improvement. Reverted to a single prompt.
◆Endless "Reflection" on code generation. The more rounds, the more defensive and over-engineered the code became. Capped at a single round.

The rule

Add an agent to the system only when an A/B test against the previous pattern proves it is worth the cost and the latency. Never for aesthetic reasons. Nothing is more expensive and fragile than a multi-agent architecture built because "it feels right".