5.3 Error Propagation in Multi-Agent Systems

5.3.1 When a Worker Fails, What Does the Boss Hear?

Back in Domain 1 you built multi-agent systems with a coordinator and subagents, and in Domain 2.2 you designed structured error responses for tools. Task Statement 5.3 brings those together: when a SUBAGENT fails, how should that failure travel back to the COORDINATOR so the coordinator can recover intelligently? The way an error propagates determines whether the whole system degrades gracefully or fails catastrophically.

Picture a research coordinator that delegated a web search to a subagent, and the search times out. What the coordinator hears next decides everything. If the subagent reports back 'search failed: timeout while querying X; here's the partial data I did get; you could try source Y instead,' the coordinator can make a smart call — retry differently, use the partial data, or note a gap. If instead the subagent just dies, or worse, quietly returns 'success: no results,' the coordinator is flying blind. It's the difference between an employee who says 'the supplier's site was down, but I found two of the three quotes — want me to call them?' versus one who either vanishes or falsely reports 'all done.'

This lesson is about propagating errors as STRUCTURED CONTEXT, and about two anti-patterns that destroy a multi-agent system's reliability: silently swallowing errors, and killing the whole workflow on one failure. Let's build up what good error propagation looks like.

A generic or silently-swallowed failure leaves the coordinator unable to recover; structured error context (type, attempt, partial results, alternatives) lets it make an intelligent recovery decision.

ℹ️

The one idea to hold onto

When a subagent fails, propagate STRUCTURED error context — failure type, what was attempted, partial results, and possible alternatives — so the coordinator can recover intelligently, rather than a generic status or a silent empty 'success.'

5.3.2 Structured Error Context: Four Elements

A good error report from a subagent to the coordinator carries four pieces of information. Together they give the coordinator everything it needs to decide what to do next.

1.FAILURE TYPE — what kind of failure (timeout, rate limit, permission, no-results), so the coordinator knows whether it's even worth retrying.
2.WHAT WAS ATTEMPTED — the query, parameters, and target, so the coordinator can decide whether to retry it differently or try another approach.
3.PARTIAL RESULTS — anything the subagent managed to gather BEFORE failing. Don't discard this; partial data may let the coordinator proceed without redoing work.
4.POTENTIAL ALTERNATIVES — approaches the subagent suggests, like a different source or a narrower query, giving the coordinator a head start on recovery.

Why all four? Because each removes a different blind spot. Without the failure type, the coordinator can't tell a retryable hiccup from a dead end. Without what-was-attempted, it might blindly repeat the same failing call. Without partial results, it throws away completed work. Without alternatives, it has to reinvent the recovery from scratch. A generic 'search unavailable' message strips away all four — it tells the coordinator nothing actionable, which is exactly why generic errors are an anti-pattern.

Element	Lets the coordinator…
Failure type	Decide if a retry could even work
What was attempted	Avoid blindly repeating the same failing call
Partial results	Proceed without redoing completed work
Potential alternatives	Start recovery with a concrete next step

The four elements of structured error context. A generic 'search unavailable' provides none of them, leaving the coordinator unable to make an informed decision.

⭐

5.3.2 — Key Concept

Structured error context has four elements: failure type, what was attempted (query/params/target), partial results gathered before failure, and potential alternative approaches. Generic statuses ('search unavailable') hide all four and prevent informed recovery.

5.3.3 The Two Anti-Patterns

Two ways of handling subagent failures are specifically wrong, and the exam tests both. The FIRST is SILENT SUPPRESSION: a subagent catches an error and returns something like {results: [], status: 'success'}. This is the most dangerous failure mode because it's INVISIBLE — the coordinator believes everything worked and produces a confidently incomplete result, with no signal that anything went wrong. It's the same family as the narrow-decomposition bug from Lesson 1.2, but caused by a hidden error instead of a bad split. Returning empty-as-success doesn't avoid the failure; it just hides it until it surfaces as a silently wrong final answer.

The SECOND anti-pattern is WORKFLOW TERMINATION: one subagent fails, so the entire research pipeline aborts. This throws away all the work the other subagents successfully completed, and it's an overreaction — a single recoverable failure shouldn't kill a job that's 90% done. The coordinator should be given the chance to proceed with partial results, retry just the failed part, or note the gap — none of which is possible if the whole workflow has already terminated.

Both anti-patterns share a root error: they take the decision away from the COORDINATOR. Silent suppression hides the failure so the coordinator can't decide; termination decides 'abort everything' unilaterally. Good propagation does the opposite — it hands the coordinator a complete, structured picture and lets IT decide, because the coordinator is the only agent with the whole view (the hub-and-spoke principle from 1.2).

⚠️

5.3.3 — Key Concept

Two anti-patterns: silent suppression (returning empty results as 'success' — hides the gap, yields confidently incomplete output) and workflow termination (one failure aborts the whole pipeline — wastes completed work). Both wrongly take the recovery decision away from the coordinator.

5.3.4 Local Recovery and Coverage Annotations

Two practices complete the picture. LOCAL RECOVERY: a subagent should handle TRANSIENT failures itself before propagating anything. A momentary timeout or rate-limit is worth a local retry-with-backoff — there's no need to bother the coordinator with a hiccup the subagent can fix on its own. Only NON-recoverable errors (or ones that persist after local retries) get propagated up, and when they do, they carry the structured context plus whatever partial results were gathered. This keeps the coordinator focused on decisions only it can make. (This is the access-failure-vs-valid-empty distinction from 2.2, applied across agents: retry the access failure locally; a valid empty result isn't a failure at all.)

COVERAGE ANNOTATIONS: when the synthesis stage produces the final output, it should annotate which findings are WELL-SUPPORTED versus which topic areas have GAPS because a source was unavailable. This makes the limitations of the result honest and visible — the reader knows 'we covered A and B thoroughly but couldn't reach data on C' rather than receiving a confident report that silently omits C. Coverage annotations are the antidote to silent suppression at the output level: they surface gaps instead of hiding them.

ℹ️

5.3.4 — Key Concept

Subagents recover from TRANSIENT failures locally (retry with backoff) and propagate only non-recoverable errors up with structured context + partial results. Synthesis output should carry COVERAGE ANNOTATIONS — which findings are well-supported vs which topics have gaps from unavailable sources.

5.3.5 The Exam Traps

The 5.3 traps test the structured-context vs generic/silent/terminate choice — the same shape as the 2.2 error-handling traps, now at the multi-agent level.

•Silent suppression. ✗ Returning empty results marked 'success' on a failure. ✓ Propagate structured error context with partial results.
•Generic status. ✗ Returning 'search unavailable' after exhausting retries. ✓ Include type, attempt, partial results, alternatives.
•Killing the workflow. ✗ Aborting the whole pipeline on one subagent failure. ✓ Let the coordinator proceed with partial results / retry the failed part.
•No local recovery. ✗ Propagating every transient hiccup to the coordinator. ✓ Retry transient failures locally; propagate only non-recoverable ones.

⚠️

5.3.5 — Exam Trap

For a subagent failure (e.g. a search timeout): ✓ return structured error context (type, attempt, partial results, alternatives) so the coordinator recovers. ✗ silently return empty-as-success, ✗ a generic 'unavailable' status, ✗ terminate the whole workflow. Recover transient errors locally; propagate the rest with context.

5.3.6 Put It Together: Propagate Errors Well

You now know structured error context's four elements, the two anti-patterns, local recovery, and coverage annotations. The exercise has you build error propagation and prove that the anti-patterns hide or waste, while structured context enables recovery.

✨

5.3.6 — Build Exercise (45 min)

(1) In a coordinator + search/analysis pipeline, simulate a subagent timeout and have it return structured error context (failure type, attempted query, partial results, alternatives); confirm the coordinator can proceed with partial results and annotate a coverage gap. (2) Replace that with silent suppression (empty-as-success) and observe the coordinator produce a confidently incomplete report with no signal. (3) Replace it with workflow termination and observe completed work thrown away. (4) Add local recovery: have the subagent retry a transient failure itself and only propagate when it can't resolve it.

Error propagation keeps multi-agent systems reliable under failure. The next lesson, 5.4, tackles a different reliability threat — how an agent's quality DEGRADES during long codebase exploration, even when nothing 'fails.'

ℹ️

Where this shows up on the exam

5.3 questions describe a subagent failure and ask how it should propagate. The answer is structured error context (type/attempt/partial/alternatives) so the coordinator decides — never silent empty-as-success, a generic status, or terminating the whole workflow.

Key Takeaways

✓When a subagent fails, propagate STRUCTURED error context so the coordinator — which has the whole view — can recover intelligently, rather than a generic status or silent 'success.'
✓Structured error context has four elements: failure type, what was attempted (query/params/target), partial results gathered before failure, and potential alternative approaches.
✓A generic 'search unavailable' status hides all four elements and prevents informed recovery — generic errors are an anti-pattern.
✓Anti-pattern 1 — silent suppression: returning empty results as 'success' hides the gap and yields confidently incomplete output (the invisible failure).
✓Anti-pattern 2 — workflow termination: aborting the whole pipeline on one failure wastes the work other subagents completed; let the coordinator proceed/retry instead.
✓Subagents recover from TRANSIENT failures locally (retry with backoff) and propagate only non-recoverable errors up with partial results (the 2.2 access-vs-empty distinction, across agents).
✓Synthesis output should carry COVERAGE ANNOTATIONS — which findings are well-supported vs which topics have gaps from unavailable sources — surfacing limitations honestly.

Check Your Understanding

Test what you learned in this lesson.

Q1.A web-search subagent times out while researching. How should this failure be reported to the coordinator to best enable recovery?

Q2.Why is returning {results: [], status: 'success'} on a subagent failure especially dangerous?

Q3.A subagent hits a transient timeout it could likely recover from with a quick retry. What's the best behavior?

Q4.How should a synthesis stage handle topics where a source was unavailable?

Practice This Lesson

5.2 Escalation & Ambiguity Resolution

5.4 Context in Large Codebase Exploration