Domain 1: Agentic Architecture & Orchestration
27% of examDesign and implement agentic loops for autonomous task execution
Key Points
- The agentic loop lifecycle: send request, inspect stop_reason, execute tools, append results, repeat until end_turn.
- stop_reason is the sole authoritative signal for loop control -- not text parsing, not iteration counts.
- Tool results must be appended to conversation history so Claude can reason about the next action.
- Model-driven tool selection (Claude decides which tool based on context) is the default; pre-configured sequences are for strict compliance.
- Each iteration should include the full conversation context so Claude maintains coherent reasoning.
Decision Rules
When: stop_reason === 'tool_use'
→Execute the requested tool(s), append results to messages, and call Claude again.
When: stop_reason === 'end_turn'
→Terminate the loop and present the final response to the user.
When: You need a safety guardrail against runaway loops
→Add a max iteration count as a backstop, but keep stop_reason as the primary control signal.
✗ Anti-Patterns to Reject
- Parsing response text for phrases like 'I've completed' to determine loop termination instead of using stop_reason.
- Using an arbitrary iteration cap as the primary stopping mechanism rather than a safety backstop.
Orchestrate multi-agent systems with coordinator-subagent patterns
Key Points
- Hub-and-spoke: coordinator manages all inter-subagent communication, error handling, and information routing.
- Subagents operate with isolated context -- they do NOT inherit the coordinator's conversation history.
- The coordinator is responsible for task decomposition, delegation, result aggregation, and deciding which subagents to invoke.
- Overly narrow task decomposition by the coordinator leads to incomplete coverage of broad topics.
- Route all communication through the coordinator for observability, consistent error handling, and controlled information flow.
Decision Rules
When: Multiple specialized capabilities are needed (search, analysis, synthesis)
→Use coordinator-subagent pattern; coordinator delegates to specialized agents and aggregates results.
When: A subagent's output needs to reach another subagent
→Route through the coordinator -- never allow direct agent-to-agent communication.
When: Research output is missing entire topic areas
→Check the coordinator's task decomposition first -- it likely defined subtasks too narrowly.
✗ Anti-Patterns to Reject
- Allowing direct agent-to-agent communication that bypasses the coordinator, breaking observability and error handling.
- Having the coordinator always route through the full pipeline instead of dynamically selecting which subagents to invoke.
Configure subagent invocation, context passing, and spawning
Key Points
- The Task tool is the mechanism for spawning subagents; allowedTools must include 'Task' for the coordinator.
- Subagent context must be explicitly provided in the prompt -- subagents do NOT automatically inherit parent context.
- AgentDefinition configures descriptions, system prompts, and tool restrictions per subagent type.
- Use fork-based session management to explore divergent approaches from a shared analysis baseline.
- Spawn parallel subagents by emitting multiple Task tool calls in a single coordinator response.
Decision Rules
When: A subagent needs data from a prior agent's output
→Include the complete findings directly in the subagent's prompt via the coordinator.
When: You need parallel research across multiple source types
→Emit multiple Task tool calls in a single coordinator turn to spawn parallel subagents.
When: Coordinator prompts lead to rigid subagent behavior
→Specify research goals and quality criteria rather than step-by-step procedural instructions.
✗ Anti-Patterns to Reject
- Assuming subagents inherit the coordinator's context or share memory between invocations.
- Writing step-by-step procedural coordinator prompts instead of goal-oriented ones that allow subagent adaptability.
Implement multi-step workflows with enforcement and handoff patterns
Key Points
- Programmatic enforcement (hooks, prerequisite gates) provides deterministic guarantees; prompt instructions are probabilistic.
- When deterministic compliance is required (e.g., identity verification before financial ops), prompts alone have a non-zero failure rate.
- For multi-concern requests, decompose into distinct items, investigate each in parallel using shared context, then synthesize.
- Structured handoff summaries (customer ID, root cause, refund amount, recommended action) are essential for human escalation.
Decision Rules
When: A specific tool sequence is required for critical business logic (e.g., verify customer before refund)
→Use programmatic prerequisites that block downstream tools until prior steps complete.
When: Customer sends a multi-concern message
→Decompose into distinct concerns, investigate in parallel with shared context, then synthesize a unified resolution.
When: Agent escalates to a human who lacks access to the conversation transcript
→Compile a structured handoff summary with customer ID, root cause, amounts, and recommended action.
✗ Anti-Patterns to Reject
- Relying solely on prompt instructions to enforce required tool ordering for operations with financial consequences.
- Processing multiple customer concerns sequentially, re-fetching shared context for each one.
Apply Agent SDK hooks for tool call interception and data normalization
Key Points
- PostToolUse hooks intercept tool results for transformation BEFORE the model processes them.
- Hook patterns can also intercept outgoing tool calls to enforce compliance rules (e.g., block refunds above a threshold).
- Hooks provide deterministic guarantees; prompt instructions provide only probabilistic compliance.
- Use PostToolUse to normalize heterogeneous data formats: Unix timestamps, ISO 8601, numeric status codes.
Decision Rules
When: Tools return heterogeneous formats (Unix timestamps, ISO dates, numeric codes) and the agent misinterprets them
→Implement a PostToolUse hook to normalize all outputs before agent processing.
When: Business rules require guaranteed compliance (e.g., refunds > $500 must be escalated)
→Use a hook to intercept and block policy-violating tool calls, redirecting to the appropriate workflow.
When: Third-party MCP tools return data you cannot modify at the source
→Use PostToolUse hooks as a centralized normalization layer rather than prompt instructions.
✗ Anti-Patterns to Reject
- Adding format documentation to the system prompt instead of using hooks when deterministic normalization is required.
- Creating a separate normalize_data tool the agent must remember to call, instead of automatic hook-based transformation.
Design task decomposition strategies for complex workflows
Key Points
- Use fixed sequential pipelines (prompt chaining) for predictable multi-aspect reviews; dynamic decomposition for open-ended investigation.
- Splitting large reviews into per-file local analysis plus a separate cross-file integration pass avoids attention dilution.
- Adaptive investigation plans generate subtasks based on what is discovered at each step.
- For open-ended tasks, first map the structure, identify high-impact areas, then create a prioritized plan.
Decision Rules
When: A single-pass review of 14+ files produces inconsistent depth and contradictory findings
→Split into per-file analysis passes plus a separate cross-file integration pass.
When: The task is predictable with known steps (e.g., multi-aspect code review)
→Use prompt chaining: a fixed sequential pipeline.
When: The task is exploratory with unknown scope (e.g., 'add tests to a legacy codebase')
→Use dynamic decomposition: map first, identify high-impact areas, then create a prioritized adaptive plan.
✗ Anti-Patterns to Reject
- Reviewing all files in a large PR in a single pass, leading to attention dilution and contradictory feedback.
- Using a fixed pipeline for an open-ended investigation task where subtasks depend on intermediate findings.
Manage session state, resumption, and forking
Key Points
- Use --resume <session-name> to continue named investigation sessions across work sessions.
- fork_session creates independent branches from a shared analysis baseline for exploring divergent approaches.
- When resuming after code modifications, inform the agent about specific file changes for targeted re-analysis.
- Starting a new session with a structured summary is more reliable than resuming with stale tool results.
Decision Rules
When: Prior context is mostly valid and you want to continue an investigation
→Use --resume with the session name; inform Claude about any file changes since last session.
When: Prior tool results are stale (significant code changes since last session)
→Start a new session with an injected summary of prior findings instead of resuming.
When: You want to compare two refactoring approaches from the same analysis baseline
→Use fork_session to create parallel exploration branches.
✗ Anti-Patterns to Reject
- Resuming a session after significant code changes without informing the agent, leading to stale context reasoning.
- Re-exploring the entire codebase from scratch instead of informing a resumed session about targeted changes.
Domain 2: Tool Design & MCP Integration
18% of examDesign effective tool interfaces with clear descriptions and boundaries
Key Points
- Tool descriptions are the PRIMARY mechanism LLMs use for tool selection -- minimal descriptions lead to unreliable selection.
- Include input formats, example queries, edge cases, and boundaries explaining when to use a tool vs similar alternatives.
- Ambiguous or overlapping descriptions (e.g., analyze_content vs analyze_document) cause misrouting.
- Keyword-sensitive system prompt instructions can override well-written tool descriptions, creating unintended tool associations.
- Rename tools and update descriptions to eliminate functional overlap (e.g., analyze_content -> extract_web_results).
Decision Rules
When: Agent consistently selects the wrong tool among similar options
→Review and expand tool descriptions FIRST -- include input formats, example queries, and boundary explanations.
When: Two tools have near-identical names/descriptions causing misrouting
→Rename the tools and rewrite descriptions to clearly distinguish each tool's purpose.
When: Tool descriptions are clear but the agent still misroutes based on keywords like 'account'
→Review the system prompt for keyword-sensitive instructions that create unintended tool associations.
✗ Anti-Patterns to Reject
- Writing minimal descriptions like 'Retrieves customer information' without specifying inputs, outputs, or boundaries.
- Adding a routing layer or classifier as the first step instead of improving tool descriptions.
Implement structured error responses for MCP tools
Key Points
- Use the MCP isError flag to communicate tool failures back to the agent.
- Distinguish error categories: transient (timeouts), validation (bad input), business (policy violations), permission errors.
- Return structured metadata: errorCategory, isRetryable boolean, and human-readable descriptions.
- Uniform 'Operation failed' errors prevent the agent from making appropriate recovery decisions.
- Distinguish access failures (needing retries) from valid empty results (successful queries with no matches).
Decision Rules
When: A tool encounters a transient failure (timeout, service unavailable)
→Return isError: true with errorCategory: 'transient', isRetryable: true, and what was attempted.
When: A business rule is violated (e.g., refund exceeds policy limit)
→Return isError: true with errorCategory: 'business', isRetryable: false, and a customer-friendly explanation.
When: A query returns zero results but executed successfully
→Return a success response (isError: false) with empty results -- do NOT treat this as an error.
✗ Anti-Patterns to Reject
- Returning generic 'Operation failed' for all error types, preventing intelligent agent recovery decisions.
- Treating valid empty results (0 matches) the same as access failures (timeouts), causing unnecessary retries.
Distribute tools appropriately across agents and configure tool choice
Key Points
- Too many tools (e.g., 18 instead of 4-5) degrades tool selection reliability by increasing decision complexity.
- Agents with tools outside their specialization tend to misuse them (e.g., synthesis agent doing web searches).
- Apply principle of least privilege: give each agent only tools needed for its role, plus limited cross-role tools for high-frequency needs.
- tool_choice options: 'auto' (default), 'any' (must call a tool), forced selection ({'type': 'tool', 'name': '...'}).
Decision Rules
When: A specialized agent misuses tools outside its role (e.g., doc analysis agent doing web searches)
→Replace generic tools with purpose-specific constrained alternatives (e.g., fetch_url -> load_document).
When: 85% of a subagent's verification needs are simple fact-checks with 15% complex
→Give a scoped verify_fact tool for simple lookups; route complex cases through the coordinator.
When: You need to guarantee the model calls a specific tool first in a sequence
→Use tool_choice: {'type': 'tool', 'name': 'extract_metadata'} for the first turn, then switch to 'auto'.
✗ Anti-Patterns to Reject
- Giving all agents access to all tools, leading to cross-specialization misuse and unreliable selection.
- Giving the synthesis agent full web search tools when a scoped verify_fact tool handles 85% of its needs.
Integrate MCP servers into Claude Code and agent workflows
Key Points
- Project-scoped .mcp.json for shared team tooling; user-scoped ~/.claude.json for personal/experimental servers.
- Use environment variable expansion (${GITHUB_TOKEN}) in .mcp.json for credential management without committing secrets.
- Tools from all configured MCP servers are discovered at connection time and available simultaneously.
- MCP resources expose content catalogs (issue summaries, database schemas) to reduce exploratory tool calls.
- Prefer community MCP servers for standard integrations (Jira, GitHub); build custom servers only for team-specific workflows.
Decision Rules
When: Team needs shared MCP tooling with per-developer credentials
→Use project-scoped .mcp.json with ${ENV_VAR} expansion for tokens; document required vars in README.
When: A developer wants to experiment with a personal MCP server
→Configure it in user-scoped ~/.claude.json so it does not affect teammates.
When: A standard integration exists (GitHub, Jira) and you are considering a custom server
→Use the existing community MCP server; reserve custom implementations for team-specific workflows.
✗ Anti-Patterns to Reject
- Building custom MCP server wrappers when native env var expansion in .mcp.json already handles credential injection.
- Having each developer configure the MCP server in user scope instead of using a shared project-scoped .mcp.json.
Select and apply built-in tools (Read, Write, Edit, Bash, Grep, Glob) effectively
Key Points
- Grep for content search: finding function names, error messages, import statements within file contents.
- Glob for path pattern matching: finding files by name or extension (e.g., **/*.test.tsx).
- Read/Write for full file operations; Edit for targeted modifications using unique text matching.
- When Edit fails due to non-unique text matches, fall back to Read + Write for reliable file modifications.
- Build codebase understanding incrementally: Grep to find entry points, then Read to follow imports and trace flows.
Decision Rules
When: You need to find all callers of a specific function across the codebase
→Use Grep to search file contents for the function name.
When: You need to find all test files regardless of directory location
→Use Glob with pattern **/*.test.tsx to match by naming convention.
When: Edit fails because the anchor text appears multiple times in the file
→Use Read to load full contents, then Write the modified version as a fallback.
✗ Anti-Patterns to Reject
- Reading all files upfront to understand a codebase instead of incrementally tracing from entry points via Grep.
- Using Bash for file search/content operations when dedicated Grep and Glob tools are available.
Domain 3: Claude Code Configuration & Workflows
20% of examConfigure CLAUDE.md files with appropriate hierarchy, scoping, and modular organization
Key Points
- Hierarchy: user-level (~/.claude/CLAUDE.md), project-level (.claude/CLAUDE.md or root CLAUDE.md), directory-level (subdirectory CLAUDE.md).
- User-level settings apply only to that user and are NOT shared via version control.
- Use .claude/rules/ directory for topic-specific rule files as an alternative to a monolithic CLAUDE.md.
- Use @import syntax to reference external files and keep CLAUDE.md modular.
- New team members not receiving guidelines? Check if instructions are in user-level (~/) rather than project-level (.claude/).
Decision Rules
When: A guideline must apply to all team members (current and future)
→Place it in project-level .claude/CLAUDE.md or .claude/rules/, NOT in user-level ~/.claude/CLAUDE.md.
When: CLAUDE.md exceeds 400+ lines mixing multiple concerns
→Split into topic-specific files in .claude/rules/ (e.g., testing.md, api-conventions.md).
When: A new team member is not receiving project guidelines
→Verify the guideline exists in project-level config, not just in existing developers' user-level config.
✗ Anti-Patterns to Reject
- Putting team-wide guidelines in ~/.claude/CLAUDE.md (user-level) instead of project-level, so new members miss them.
- Using README.md files as instruction sources -- only CLAUDE.md and .claude/rules/ are recognized by Claude Code.
Create and configure custom slash commands and skills
Key Points
- Project-scoped commands in .claude/commands/ (shared via version control); user-scoped in ~/.claude/commands/ (personal).
- Skills in .claude/skills/ with SKILL.md support frontmatter: context: fork, allowed-tools, argument-hint.
- context: fork runs the skill in an isolated sub-agent context, preventing output from polluting the main conversation.
- Project skills take precedence over personal skills with the same name; use a different name for personal variants.
- Skills are on-demand (invoked via slash command); CLAUDE.md is always-loaded for universal standards.
Decision Rules
When: A skill produces verbose output that causes Claude to lose track of the original task
→Add context: fork to the skill's frontmatter to run in an isolated sub-agent context.
When: A developer wants a personal variant of a team skill without affecting teammates
→Create a personal skill in ~/.claude/skills/ with a DIFFERENT name (project skills shadow same-named personal ones).
When: Context is only useful for a specific workflow (e.g., endpoint generation) and not general work
→Create a skill with the exemplar code; invoke on-demand via slash command instead of putting it in CLAUDE.md.
✗ Anti-Patterns to Reject
- Creating a personal skill with the same name as a project skill -- the project version shadows it.
- Putting task-specific workflow guidance in CLAUDE.md (always loaded) instead of a skill (on-demand).
Apply path-specific rules for conditional convention loading
Key Points
- Use .claude/rules/ files with YAML frontmatter paths field containing glob patterns for conditional rule activation.
- Path-scoped rules load only when editing matching files, reducing irrelevant context and token usage.
- Glob patterns apply conventions by file type regardless of directory location (e.g., **/*.test.tsx for all test files).
- Path-specific rules are better than subdirectory CLAUDE.md files when conventions span multiple directories.
Decision Rules
When: Different coding conventions apply to different file types (React components vs API handlers vs tests)
→Create .claude/rules/ files with YAML frontmatter paths glob patterns for each file type.
When: Test files are spread throughout the codebase alongside source files
→Use path-specific rules with **/*.test.tsx glob rather than subdirectory CLAUDE.md files.
When: You want conventions to apply to terraform files in any directory
→Use paths: ['terraform/**/*'] in rule frontmatter instead of a terraform/CLAUDE.md file.
✗ Anti-Patterns to Reject
- Relying on Claude to infer which conventions apply by putting all rules in a single root CLAUDE.md.
- Using subdirectory CLAUDE.md files for cross-cutting concerns like test conventions that span multiple directories.
Determine when to use plan mode vs direct execution
Key Points
- Plan mode: complex tasks with multiple valid approaches, architectural decisions, multi-file changes, unfamiliar domains.
- Direct execution: simple, well-scoped changes with a clear implementation path (e.g., single-file bug fix).
- The Explore subagent isolates verbose discovery output and returns summaries, preserving main conversation context.
- Combine plan mode for investigation with direct execution for implementation (e.g., plan migration, then execute).
Decision Rules
When: Task involves ambiguous requirements with multiple valid integration approaches (e.g., adding Slack support)
→Enter plan mode to explore options and architectural implications before implementing.
When: Task is a well-understood change with clear scope (e.g., bug fix with a clear stack trace)
→Use direct execution -- no need for plan mode.
When: Discovery phase generates verbose output that fills the context window
→Use the Explore subagent to isolate verbose output and return a concise summary to the main conversation.
✗ Anti-Patterns to Reject
- Starting direct execution on an ambiguous architectural task without exploring trade-offs first.
- Using plan mode for a simple, well-scoped change that has an obvious implementation.
Apply iterative refinement techniques for progressive improvement
Key Points
- Concrete input/output examples are the most effective way to communicate transformations when prose is interpreted inconsistently.
- Test-driven iteration: write test suites first, then iterate by sharing test failures to guide improvement.
- The interview pattern: have Claude ask questions to surface design considerations before implementing in unfamiliar domains.
- Address multiple interacting issues in a single message when fixes interact; use sequential iteration for independent issues.
Decision Rules
When: Claude interprets prose requirements differently each iteration, producing inconsistent output structure
→Provide 2-3 concrete input/output examples showing the expected transformation.
When: You are implementing in an unfamiliar domain and want to surface edge cases
→Use the interview pattern: have Claude ask about design considerations before implementing.
When: Multiple bugs interact with each other
→Describe all interacting issues in a single message rather than fixing them sequentially.
✗ Anti-Patterns to Reject
- Continuing to refine prose descriptions when Claude consistently misinterprets them -- provide examples instead.
- Fixing interacting bugs one at a time, leading to regressions when each fix invalidates the others.
Integrate Claude Code into CI/CD pipelines
Key Points
- Use the -p (or --print) flag for non-interactive mode in automated pipelines -- prevents hanging on interactive input.
- Use --output-format json with --json-schema for enforced structured output in CI contexts.
- CLAUDE.md provides project context (testing standards, review criteria) to CI-invoked Claude Code.
- A second independent Claude instance reviewing code is more effective than self-review -- eliminates confirmation bias.
- Include prior review findings in context when re-running after new commits to avoid duplicate comments.
Decision Rules
When: Running Claude Code in an automated CI pipeline
→Use the -p flag for non-interactive mode; use --output-format json with --json-schema for structured output.
When: The same Claude session generated code and you need a review
→Use a second, independent Claude instance without access to the generator's reasoning context.
When: Re-running review after developer pushes fixes, and getting duplicate findings on already-fixed code
→Include prior review findings in context, instructing Claude to only report new or still-unaddressed issues.
✗ Anti-Patterns to Reject
- Running claude without -p flag in CI, causing the job to hang waiting for interactive input.
- Asking Claude to self-review its own generated code in the same session -- confirmation bias persists.
Domain 4: Prompt Engineering & Structured Output
20% of examDesign prompts with explicit criteria to improve precision and reduce false positives
Key Points
- Explicit criteria ('flag comments only when claimed behavior contradicts actual code') beat vague instructions ('check that comments are accurate').
- General instructions like 'be conservative' or 'only report high-confidence findings' fail to improve precision.
- High false positive rates in some categories undermine trust in ALL categories -- developers dismiss everything.
- Define explicit severity criteria with concrete code examples for each severity level to achieve consistent classification.
Decision Rules
When: Automated review produces high false positive rates that erode developer trust
→Temporarily disable high false-positive categories; keep only high-precision categories while improving prompts.
When: Severity ratings are inconsistent across similar issues
→Add explicit severity criteria with concrete code examples for each level, not general 'be conservative' instructions.
When: A prompt instruction is vague (e.g., 'check comments are accurate')
→Replace with explicit criteria defining exactly what constitutes a problem (e.g., 'flag only when claimed behavior contradicts code').
✗ Anti-Patterns to Reject
- Adding confidence scores alongside findings and expecting developers to self-triage -- they will not trust self-reported scores.
- Keeping high false-positive categories enabled while 'improving prompts over the coming weeks' -- trust erodes immediately.
Apply few-shot prompting to improve output consistency and quality
Key Points
- Few-shot examples are the most effective technique when detailed instructions alone produce inconsistent results.
- Target 2-4 examples at ambiguous scenarios showing reasoning for why one action was chosen over alternatives.
- Few-shot examples enable generalization to novel patterns, not just matching pre-specified cases.
- For extraction tasks, few-shot examples reduce hallucination by showing how to handle varied document structures.
Decision Rules
When: Detailed format instructions produce variable output quality (sometimes detailed, sometimes vague)
→Add 3-4 few-shot examples showing the exact desired format with issue, location, and specific fix.
When: Agent misroutes between tools on ambiguous requests
→Add 4-6 few-shot examples targeting ambiguous scenarios, each showing reasoning for the tool choice.
When: Agent handles individual concerns well (94%) but fails on multi-concern messages (58%)
→Add few-shot examples demonstrating correct reasoning and tool sequencing for multi-concern requests.
✗ Anti-Patterns to Reject
- Further refining abstract instructions when instructions have already failed -- examples are more reliable than rules.
- Grouping few-shot examples by tool instead of showing comparative reasoning across tools for ambiguous cases.
Enforce structured output using tool use and JSON schemas
Key Points
- tool_use with JSON schemas is the most reliable approach for guaranteed schema-compliant structured output.
- tool_choice: 'auto' (may return text), 'any' (must call a tool), forced selection (must call a specific tool).
- Strict JSON schemas via tool use eliminate syntax errors but do NOT prevent semantic errors (values in wrong fields, line items not summing).
- Design schema fields as optional (nullable) when source documents may not contain the information, preventing hallucinated values.
Decision Rules
When: You need guaranteed structured output with no JSON syntax errors
→Define an extraction tool with JSON schema as input parameters; extract data from the tool_use response.
When: Multiple extraction schemas exist and the document type is unknown
→Set tool_choice: 'any' to guarantee a tool call while letting the model choose which extraction schema.
When: Source documents may not contain all required fields
→Design those schema fields as optional (nullable) to prevent the model from fabricating values.
✗ Anti-Patterns to Reject
- Relying on prompt instructions to produce JSON instead of using tool_use for guaranteed schema compliance.
- Making all schema fields required when source documents may lack the data, causing the model to hallucinate values.
Implement validation, retry, and feedback loops for extraction quality
Key Points
- Retry-with-error-feedback: append specific validation errors to the prompt on retry to guide the model toward correction.
- Retries are ineffective when required information is simply absent from the source document (vs format or structural errors).
- Track which code constructs trigger findings (detected_pattern field) to enable systematic analysis of dismissal patterns.
- Semantic validation (values don't sum, wrong field placement) requires separate validation logic -- tool use only prevents syntax errors.
Decision Rules
When: Extraction output has format or structural errors (wrong nesting, bad date format)
→Retry with the original document, the failed extraction, and specific validation errors appended.
When: Required data simply does not exist in the source document
→Do NOT retry -- retries cannot conjure missing information. Accept null/empty or flag for human review.
When: Developers frequently dismiss automated findings and you want to improve accuracy
→Add detected_pattern fields to structured findings to track which constructs produce false positives.
✗ Anti-Patterns to Reject
- Retrying extraction when the source document does not contain the required information.
- Using generic retry prompts like 'try again' without including the specific validation errors that triggered the retry.
Design efficient batch processing strategies
Key Points
- Message Batches API: 50% cost savings, up to 24-hour processing window, no guaranteed latency SLA.
- Batch processing is appropriate for non-blocking, latency-tolerant workloads (overnight reports, weekly audits, nightly test generation).
- The batch API does NOT support multi-turn tool calling within a single request -- breaks iterative workflows.
- Use custom_id fields for correlating batch request/response pairs and handling failures.
Decision Rules
When: Workflow is latency-sensitive and blocks developers (pre-merge checks)
→Use synchronous API calls, NOT batch processing.
When: Workflow is scheduled and latency-tolerant (overnight reports, weekly audits, nightly test generation)
→Use Message Batches API for 50% cost savings.
When: Workflow requires iterative tool calling (analyze file, request related files, continue analysis)
→Do NOT use batch processing -- it cannot execute tools mid-request and return results.
✗ Anti-Patterns to Reject
- Using batch processing for blocking pre-merge checks where developers are waiting for results.
- Attempting to use batch processing for iterative tool-calling workflows that require mid-request tool execution.
Design multi-instance and multi-pass review architectures
Key Points
- Self-review limitation: a model retains reasoning context from generation, making it less likely to question its own decisions.
- Independent review instances (without prior reasoning context) catch subtle issues that self-review and extended thinking miss.
- Multi-pass review: split into per-file local analysis passes plus cross-file integration passes to avoid attention dilution.
- Include reasoning and confidence assessments inline with each finding to speed up developer triage.
Decision Rules
When: Claude-generated code has subtle issues that only surface during human peer review
→Use a second, independent Claude instance to review without access to the generator's reasoning.
When: Single-pass review of many files produces inconsistent depth and contradictory feedback
→Split into per-file local passes plus a separate cross-file integration pass.
When: Developers spend too much time investigating each finding to decide if it is real
→Require Claude to include reasoning and confidence assessment inline with each finding.
✗ Anti-Patterns to Reject
- Asking Claude to self-review its own output in the same session -- confirmation bias means it rationalizes the same way.
- Using extended thinking as a substitute for independent review -- the same session context still biases the review.
Domain 5: Context Management & Reliability
15% of examManage conversation context to preserve critical information across long interactions
Key Points
- Progressive summarization loses precise details: amounts, percentages, dates get condensed into vague phrases.
- The 'lost in the middle' effect: models reliably process the beginning and end of long inputs but may omit middle sections.
- Tool results accumulate tokens disproportionate to their relevance (e.g., 40+ fields when only 5 are relevant).
- Place key findings summaries at the beginning of aggregated inputs; organize detailed results with explicit section headers.
Decision Rules
When: Customer references specific amounts ('the 15% discount I mentioned') that were summarized away
→Extract transactional facts (amounts, dates, order numbers) into a persistent 'case facts' block outside summarized history.
When: Synthesis agent omits critical findings from the middle of 75K+ token aggregated input
→Place a key findings summary at the beginning; organize the rest with explicit section headers.
When: Tool outputs return 40+ fields per lookup when only 5 are relevant
→Trim verbose tool outputs to only relevant fields before they accumulate in context.
✗ Anti-Patterns to Reject
- Relying on progressive summarization to preserve exact numerical values and dates from early in a conversation.
- Increasing the summarization threshold (e.g., 70% to 85%) instead of extracting critical facts into a persistent block.
Design effective escalation and ambiguity resolution patterns
Key Points
- Appropriate escalation triggers: customer explicitly requests human, policy exceptions/gaps, inability to make meaningful progress.
- Escalate immediately when customer explicitly demands a human -- do not first attempt investigation.
- Sentiment-based escalation and self-reported confidence scores are unreliable proxies for actual case complexity.
- When multiple customer matches are returned, ask for an additional identifier (email, phone, order number) rather than guessing.
Decision Rules
When: Policy is ambiguous or silent on the customer's specific request (e.g., competitor price matching)
→Escalate to a human for policy interpretation -- do not fabricate a policy.
When: get_customer returns multiple matches and the agent guesses wrong 15% of the time
→Instruct the agent to ask for an additional identifier before taking any customer-specific action.
When: The issue is straightforward but the customer explicitly asks for a human agent
→Escalate immediately -- honor the explicit request without attempting to resolve first.
✗ Anti-Patterns to Reject
- Using heuristics (most recent order, conversational context clues) to guess the right customer from multiple matches.
- Implementing sentiment analysis or self-reported confidence scores as escalation triggers.
Implement error propagation strategies across multi-agent systems
Key Points
- Structured error context (failure type, attempted query, partial results, alternative approaches) enables intelligent coordinator recovery.
- Distinguish access failures (timeouts needing retry decisions) from valid empty results (successful queries with no matches).
- Silently suppressing errors (returning empty as success) or terminating on single failures are both anti-patterns.
- Subagents should handle transient failures locally and only propagate errors they cannot resolve, with partial results.
Decision Rules
When: A subagent encounters a timeout (transient failure)
→Attempt local recovery; if it fails, propagate structured error context (failure type, what was attempted, partial results) to the coordinator.
When: A subagent encounters a corrupted file (permanent failure)
→Return the error with context to the coordinator -- do NOT retry (corruption is permanent).
When: Some source categories succeed while others fail in a multi-source research task
→Proceed with available data; annotate synthesis output with coverage gaps indicating which sources were unavailable.
✗ Anti-Patterns to Reject
- Returning empty results marked as 'success' when a timeout occurred, hiding the failure from the coordinator.
- Terminating the entire research workflow when one source fails, discarding all successful results.
Manage context effectively in large codebase exploration
Key Points
- Context degradation in extended sessions: models start referencing 'typical patterns' instead of specific classes discovered earlier.
- Scratchpad files persist key findings across context boundaries, countering degradation.
- Subagent delegation isolates verbose exploration output while the main agent coordinates high-level understanding.
- Structured state persistence: each agent exports state to a known location; the coordinator loads a manifest on resume.
Decision Rules
When: Discovery phase generates verbose output that fills the main context window
→Use the Explore subagent or context: fork to isolate verbose output; return a concise summary.
When: Extended exploration session shows signs of context degradation (vague references instead of specifics)
→Have agents maintain scratchpad files recording key findings; use /compact to reduce context usage.
When: Multi-phase task needs to persist findings across context boundaries
→Summarize key findings from one phase before spawning sub-agents for the next; inject summaries into initial context.
✗ Anti-Patterns to Reject
- Continuing all phases in the main conversation using /compact repeatedly -- lossy compression discards important details.
- Re-exploring the entire codebase from scratch instead of persisting findings in scratchpad files.
Design human review workflows and confidence calibration
Key Points
- Aggregate accuracy metrics (97% overall) may mask poor performance on specific document types or fields.
- Use stratified random sampling to measure error rates in high-confidence extractions and detect novel patterns.
- Field-level confidence scores should be calibrated using labeled validation sets for routing review attention.
- Validate accuracy by document type AND field segment before automating high-confidence extractions.
Decision Rules
When: Overall accuracy is 97% but you suspect some document types perform poorly
→Analyze accuracy by document type and field to identify hidden poor-performing segments.
When: You want to reduce human review overhead on high-confidence extractions
→Implement stratified random sampling of high-confidence outputs; only reduce review after validating by segment.
When: Model outputs field-level confidence scores but they do not correlate with actual accuracy
→Calibrate confidence thresholds using labeled validation sets rather than trusting raw model scores.
✗ Anti-Patterns to Reject
- Trusting aggregate accuracy metrics without breaking down performance by document type and field.
- Automating all high-confidence extractions without validating that confidence correlates with actual accuracy per segment.
Preserve information provenance and handle uncertainty in multi-source synthesis
Key Points
- Source attribution is lost during summarization if claim-source mappings are not preserved.
- Conflicting statistics from credible sources should be annotated with source attribution, not arbitrarily resolved.
- Require publication/collection dates in structured outputs to prevent temporal differences from being misinterpreted as contradictions.
- Render different content types appropriately: financial data as tables, news as prose, technical findings as structured lists.
Decision Rules
When: Two credible sources report conflicting statistics on a key metric
→Include both values with explicit source attribution; let the coordinator decide how to reconcile before synthesis.
When: Subagent outputs are compressed and downstream agents lose track of which claims came from where
→Require subagents to output structured claim-source mappings (source URLs, document names, excerpts).
When: Data from different time periods appears contradictory
→Require publication/collection dates in structured outputs to enable correct temporal interpretation.
✗ Anti-Patterns to Reject
- Applying source credibility heuristics to select one value over another -- this oversteps the subagent's role.
- Converting all content types to a uniform format (e.g., all prose) instead of rendering each type appropriately.