Error Recovery & Retry Patterns

Advanced

Implement structured error responses for MCP tools · Difficulty 3/5

error-recoveryretryresilienceerror-handling

Prerequisites

Effective error recovery requires distinguishing error types and applying the right recovery strategy for each.

Retryable vs Non-Retryable Errors

Returning structured metadata with isRetryable prevents wasted retry attempts:

Error Type	Retryable	Recovery Strategy

|------------|-----------|-------------------|

Timeout / service unavailable	Yes	Retry with exponential backoff
Rate limit exceeded	Yes	Wait for reset window, then retry
Invalid input format	No	Fix input parameters, then retry
Policy violation (e.g., refund > $500)	No	Inform user, escalate, or suggest alternative
Permission denied	No	Escalate to human or request credentials
File corruption	No	Report failure, do not retry

Business Rule Violations

For business errors, include:

retriable: false to prevent pointless retries

A customer-friendly explanation the agent can relay to the user

Suggested alternative actions (e.g., "escalate to manager" or "split into smaller amounts")

Local vs Propagated Recovery

Subagents should handle transient failures locally:

Retry timeouts with backoff (2-3 attempts)

Try fallback data sources if primary is unavailable

Proceed with partial results and annotate gaps

Propagate to the coordinator only when:

Local recovery has been exhausted

The error requires a decision beyond the subagent's scope

Always include: what was attempted, partial results, and the unresolvable failure

Access Failures vs Valid Empty Results

This distinction is critical and commonly confused:

Access failure (database timeout): The query did not complete -- retry is appropriate

Valid empty result (0 matches found): The query succeeded -- do not retry

Treating an access failure as an empty result means missing data. Treating an empty result as a failure wastes retries and may cause incorrect escalation.

Key Takeaways

✓Structured isRetryable metadata prevents wasted retry attempts on non-retryable errors
✓Business rule violations need retriable: false plus customer-friendly explanations
✓Subagents should exhaust local recovery before propagating errors to the coordinator
✓Never confuse access failures (retry) with valid empty results (accept)

Related Concepts

Tool Distribution & Least Privilege

Each agent should have only the tools needed for its specific role (4-5 tools, not 18)

MCP Server Configuration & Scoping

Use project-scoped .mcp.json for team tools, user-scoped ~/.claude.json for personal tools