Courses/Claude Certified Architect — Foundations (CCA-F)/2.2 Structured Error Responses

Domain 2: Tool Design & MCP Integration (18%)Lesson 9 of 30

2.2 Structured Error Responses

2.2.1 Why "Operation Failed" Is Useless to an Agent

Tools fail — networks time out, inputs are malformed, policies forbid an action. The question Task Statement 2.2 asks is: when a tool fails, what should it tell the agent? And the answer matters more than you'd think, because the agent has to DECIDE what to do next based purely on what the error says.

Imagine you ask an assistant to book a flight and they come back with just: "It didn't work." What now? Should they try again? Try a different airline? Give up and ask you? They — and you — have no idea, because 'it didn't work' contains no information to act on. Now imagine instead: "The airline's site timed out; this usually clears up in a minute — shall I retry?" That tells you exactly what happened and what to do. The difference between those two messages is the difference between an agent that recovers gracefully and one that flails.

That's the core insight: a generic error like "Operation failed" gives the model nothing to reason with, so it can't make a good recovery decision. A STRUCTURED error — one that says what kind of failure this is and whether retrying makes sense — lets the agent respond intelligently. This lesson is about designing those structured errors.

A generic error gives the agent nothing to act on. A structured error — category, retryable flag, readable description — lets it choose the right recovery.

ℹ️

The one idea to hold onto

A generic 'Operation failed' is useless to an agent because it can't decide what to do next. Structured error metadata — what kind of failure, and whether to retry — is what lets the agent recover intelligently.

2.2.2 The Four Kinds of Failure

Not all failures are the same, and the right response depends on the KIND. There are four categories worth distinguishing, and the key thing each tells you is whether retrying could possibly help.

Category	Example	Retry?	Right response
Transient	Timeout, rate limit, service briefly down	Yes	Retry with backoff
Validation	Malformed or invalid input	Yes*	Fix the input, then retry
Business	Policy violation (e.g. refund not allowed)	No	Explain to the user / escalate
Permission	Access denied	No	Escalate; retrying won't grant access

Four failure categories. Transient and validation errors are worth retrying (validation only after correcting the input); business and permission errors are not — retrying just wastes attempts.

Why does this matter so much? Because an agent that can't tell these apart wastes effort and frustrates users. If it retries a business error ('refund denied by policy') ten times, it never succeeds and looks broken. If it gives up on a transient timeout that a single retry would have fixed, it fails unnecessarily. Encoding the category — and an explicit isRetryable flag — lets the agent skip the pointless retries and pursue the ones that can work.

⭐

2.2.2 — Key Concept

Distinguish four failure categories — transient (retry), validation (retry after fixing input), business (don't retry; escalate), and permission (don't retry; escalate). Returning the category plus an isRetryable flag prevents the agent from wasting retries on errors that can never succeed.

2.2.3 The Distinction the Exam Tests Most: Failure vs Empty

Now the single most-tested idea in 2.2, and it's a subtle one. There's a world of difference between 'I couldn't run the search' and 'I ran the search and found nothing' — but a careless tool reports both the same way, and that causes the agent to behave badly.

Picture a search tool. Case one: the search service was unreachable — an ACCESS FAILURE. The agent genuinely doesn't know the answer; retrying might help. Case two: the search ran perfectly and returned zero matches — a VALID EMPTY RESULT. 'No matches' IS the answer; retrying is pointless and will just return zero again. In MCP terms, the access failure sets isError: true (something went wrong), while the valid empty result sets isError: false with a result count of zero (nothing went wrong — there simply are no matches).

If a tool collapses these two into one ambiguous response, the agent can't tell them apart, and it does the wrong thing — most commonly, it RETRIES a query that genuinely has no results, over and over, as if the empty result were a failure. Designing the tool to clearly signal which case it is — failure versus legitimately empty — is what keeps the agent from chasing answers that aren't there.

The most-tested 2.2 distinction. An access failure (isError:true) may warrant a retry; a valid empty result (isError:false, count 0) is the answer itself — retrying it is pointless.

⭐

2.2.3 — Key Concept

Distinguish an access failure (couldn't execute — isError:true, may retry) from a valid empty result (executed successfully, zero matches — isError:false, the answer is 'none', do NOT retry). Collapsing them makes the agent retry queries that genuinely have no results.

2.2.4 Errors in a Multi-Agent System

Structured errors matter even more once you have the multi-agent systems from Domain 1, because now errors have to travel between agents without losing meaning. The pattern combines two ideas you've already met: local recovery and structured propagation.

First, LOCAL recovery: a subagent should handle transient failures itself — retry that timeout locally before bothering the coordinator. There's no reason to escalate a hiccup the subagent can fix on its own. Second, structured PROPAGATION: when a subagent hits a failure it genuinely can't resolve, it shouldn't just die or return an empty result pretending success. It should propagate UP to the coordinator a structured report — the failure category, what it was attempting, and any partial results it managed to gather before failing — so the coordinator can decide intelligently: retry with a different approach, proceed with partial data, or note the gap. (This connects directly to Domain 5.3, error propagation.)

The anti-pattern here is the silent failure: a subagent catches an error and returns {results: [], status: 'success'}, hiding the problem. The coordinator then thinks everything is fine and produces a confidently incomplete result — the same family of bug as the narrow-decomposition failure from Lesson 1.2, but caused by swallowed errors instead.

ℹ️

2.2.4 — Key Concept

In multi-agent systems, subagents recover from transient failures LOCALLY, and propagate only non-recoverable errors UP — with the category, what was attempted, and partial results. Never silently swallow an error by returning empty-as-success; that hides gaps from the coordinator.

2.2.5 The Exam Traps

The 2.2 traps cluster around two confusions: treating all errors the same, and confusing a failure with a legitimately empty result. Keep the four categories and the failure-vs-empty distinction sharp and they fall away.

•Retrying an empty result. ✗ Treating 'zero matches' as a failure and retrying. ✓ A valid empty result (isError:false, count 0) IS the answer — don't retry.
•Generic error messages. ✗ Returning 'Operation failed' with no structure. ✓ Return category + isRetryable + a readable description so the agent can decide.
•Retrying business errors. ✗ Retrying a policy violation as if it were transient. ✓ Business and permission errors are not retryable — explain or escalate.
•Silent suppression. ✗ A subagent returning empty results marked 'success' on a timeout. ✓ Propagate a structured error with partial results so the coordinator sees the gap.

Symptom	Wrong behavior	Right behavior
Search returns no matches	Retry repeatedly	Accept 'none' as the answer
Refund denied by policy	Retry the refund	Explain to user / escalate (not retryable)
Subagent times out	Return empty as success	Propagate structured error + partial results

Most 2.2 mistakes are either retrying something that can't succeed or hiding a failure. The cure is structured errors and the failure-vs-empty distinction.

⚠️

2.2.5 — Exam Trap

✗ Retrying valid empty results. ✗ Generic 'Operation failed' messages. ✗ Treating business/permission errors as retryable. ✗ Silently returning empty-as-success in a subagent. ✓ Structured metadata (category, isRetryable, description), the failure-vs-empty distinction, local recovery, and honest propagation with partial results.

2.2.6 Put It Together: Design Recoverable Errors

You now know why generic errors fail an agent, the four failure categories, the all-important failure-vs-empty distinction, and how errors should flow through a multi-agent system. The exercise has you build tools that fail INFORMATIVELY and watch the agent's behaviour change as a result.

✨

2.2.6 — Build Exercise (30 min)

(1) Give a tool structured error responses: errorCategory (transient/validation/business/permission), an isRetryable boolean, and a human-readable description. (2) Test that the agent retries transient errors but explains business errors to the user instead of retrying. (3) Make the tool clearly distinguish an access failure (isError:true) from a valid empty result (isError:false, count 0), and confirm the agent stops retrying 'no matches'. (4) In a two-agent setup, have the subagent recover from a transient failure locally, then propagate a non-recoverable one upward with partial results — and prove that silently returning empty-as-success hides the gap from the coordinator.

Good descriptions get the right tool called (2.1); structured errors let the agent recover when a tool fails (2.2). The next lesson, 2.3, steps back to the fleet level: how many tools an agent should have, and how to control which tool it's allowed — or forced — to call.

ℹ️

Where this shows up on the exam

2.2 questions describe an agent mishandling a failure. If you can instantly classify the error (transient/validation/business/permission) and tell an access failure from a valid empty result, you'll pick the right recovery every time.

Key Takeaways

✓A generic 'Operation failed' is useless to an agent because it can't decide what to do next; structured error metadata lets it recover intelligently.
✓Four failure categories: transient (retry with backoff), validation (retry after fixing input), business (don't retry — explain/escalate), permission (don't retry — escalate). Return the category plus an isRetryable flag.
✓The most-tested distinction: an access failure (couldn't execute — isError:true, may retry) vs a valid empty result (ran fine, zero matches — isError:false, the answer is 'none', do NOT retry).
✓Collapsing failure and empty-result into one response makes the agent retry queries that genuinely have no matches — design tools to signal which case applies.
✓In multi-agent systems, subagents recover from transient failures locally and propagate only non-recoverable errors up — with the category, what was attempted, and partial results.
✓The silent-failure anti-pattern (returning empty results marked 'success' on a real error) hides gaps from the coordinator and yields confidently incomplete output.
✓Business and permission errors are never retryable — retrying them just wastes attempts and looks broken to users.

Check Your Understanding

Test what you learned in this lesson.

Q1.A web-search subagent times out while researching. How should this failure be reported to the coordinator for best recovery?

Q2.A search tool returns zero matches because the query genuinely has no results. How should the tool report this, and what should the agent do?

Q3.Which error category should NOT be retried?

Q4.Why is it dangerous for a subagent to catch an error and return {results: [], status: 'success'}?

Practice This Lesson

2.1 Designing Tool Interfaces

2.3 Tool Distribution & Tool Choice