Error Recovery & Retry Patterns

Advanced

Implement structured error responses for MCP tools · Difficulty 3/5

0%
error-recoveryretryresilienceerror-handling

Effective error recovery requires distinguishing error types and applying the right recovery strategy for each.

Retryable vs Non-Retryable Errors

Returning structured metadata with isRetryable prevents wasted retry attempts:

Error TypeRetryableRecovery Strategy

|------------|-----------|-------------------|

Timeout / service unavailableYesRetry with exponential backoff
Rate limit exceededYesWait for reset window, then retry
Invalid input formatNoFix input parameters, then retry
Policy violation (e.g., refund > $500)NoInform user, escalate, or suggest alternative
Permission deniedNoEscalate to human or request credentials
File corruptionNoReport failure, do not retry

Business Rule Violations

For business errors, include:

  • retriable: false to prevent pointless retries
  • A customer-friendly explanation the agent can relay to the user
  • Suggested alternative actions (e.g., "escalate to manager" or "split into smaller amounts")
  • Local vs Propagated Recovery

    Subagents should handle transient failures locally:

  • Retry timeouts with backoff (2-3 attempts)
  • Try fallback data sources if primary is unavailable
  • Proceed with partial results and annotate gaps
  • Propagate to the coordinator only when:

  • Local recovery has been exhausted
  • The error requires a decision beyond the subagent's scope
  • Always include: what was attempted, partial results, and the unresolvable failure
  • Access Failures vs Valid Empty Results

    This distinction is critical and commonly confused:

  • Access failure (database timeout): The query did not complete -- retry is appropriate
  • Valid empty result (0 matches found): The query succeeded -- do not retry
  • Treating an access failure as an empty result means missing data. Treating an empty result as a failure wastes retries and may cause incorrect escalation.

    Key Takeaways

    • Structured isRetryable metadata prevents wasted retry attempts on non-retryable errors
    • Business rule violations need retriable: false plus customer-friendly explanations
    • Subagents should exhaust local recovery before propagating errors to the coordinator
    • Never confuse access failures (retry) with valid empty results (accept)