Domain 4: Prompt Engineering & Structured Output
20% of examDesign prompts with explicit criteria to improve precision and reduce false positives
Key Points
- Explicit criteria ('flag comments only when claimed behavior contradicts actual code') beat vague instructions ('check that comments are accurate').
- General instructions like 'be conservative' or 'only report high-confidence findings' fail to improve precision.
- High false positive rates in some categories undermine trust in ALL categories -- developers dismiss everything.
- Define explicit severity criteria with concrete code examples for each severity level to achieve consistent classification.
Decision Rules
When: Automated review produces high false positive rates that erode developer trust
→Temporarily disable high false-positive categories; keep only high-precision categories while improving prompts.
When: Severity ratings are inconsistent across similar issues
→Add explicit severity criteria with concrete code examples for each level, not general 'be conservative' instructions.
When: A prompt instruction is vague (e.g., 'check comments are accurate')
→Replace with explicit criteria defining exactly what constitutes a problem (e.g., 'flag only when claimed behavior contradicts code').
✗ Anti-Patterns to Reject
- Adding confidence scores alongside findings and expecting developers to self-triage -- they will not trust self-reported scores.
- Keeping high false-positive categories enabled while 'improving prompts over the coming weeks' -- trust erodes immediately.
Apply few-shot prompting to improve output consistency and quality
Key Points
- Few-shot examples are the most effective technique when detailed instructions alone produce inconsistent results.
- Target 2-4 examples at ambiguous scenarios showing reasoning for why one action was chosen over alternatives.
- Few-shot examples enable generalization to novel patterns, not just matching pre-specified cases.
- For extraction tasks, few-shot examples reduce hallucination by showing how to handle varied document structures.
Decision Rules
When: Detailed format instructions produce variable output quality (sometimes detailed, sometimes vague)
→Add 3-4 few-shot examples showing the exact desired format with issue, location, and specific fix.
When: Agent misroutes between tools on ambiguous requests
→Add 4-6 few-shot examples targeting ambiguous scenarios, each showing reasoning for the tool choice.
When: Agent handles individual concerns well (94%) but fails on multi-concern messages (58%)
→Add few-shot examples demonstrating correct reasoning and tool sequencing for multi-concern requests.
✗ Anti-Patterns to Reject
- Further refining abstract instructions when instructions have already failed -- examples are more reliable than rules.
- Grouping few-shot examples by tool instead of showing comparative reasoning across tools for ambiguous cases.
Enforce structured output using tool use and JSON schemas
Key Points
- tool_use with JSON schemas is the most reliable approach for guaranteed schema-compliant structured output.
- tool_choice: 'auto' (may return text), 'any' (must call a tool), forced selection (must call a specific tool).
- Strict JSON schemas via tool use eliminate syntax errors but do NOT prevent semantic errors (values in wrong fields, line items not summing).
- Design schema fields as optional (nullable) when source documents may not contain the information, preventing hallucinated values.
Decision Rules
When: You need guaranteed structured output with no JSON syntax errors
→Define an extraction tool with JSON schema as input parameters; extract data from the tool_use response.
When: Multiple extraction schemas exist and the document type is unknown
→Set tool_choice: 'any' to guarantee a tool call while letting the model choose which extraction schema.
When: Source documents may not contain all required fields
→Design those schema fields as optional (nullable) to prevent the model from fabricating values.
✗ Anti-Patterns to Reject
- Relying on prompt instructions to produce JSON instead of using tool_use for guaranteed schema compliance.
- Making all schema fields required when source documents may lack the data, causing the model to hallucinate values.
Implement validation, retry, and feedback loops for extraction quality
Key Points
- Retry-with-error-feedback: append specific validation errors to the prompt on retry to guide the model toward correction.
- Retries are ineffective when required information is simply absent from the source document (vs format or structural errors).
- Track which code constructs trigger findings (detected_pattern field) to enable systematic analysis of dismissal patterns.
- Semantic validation (values don't sum, wrong field placement) requires separate validation logic -- tool use only prevents syntax errors.
Decision Rules
When: Extraction output has format or structural errors (wrong nesting, bad date format)
→Retry with the original document, the failed extraction, and specific validation errors appended.
When: Required data simply does not exist in the source document
→Do NOT retry -- retries cannot conjure missing information. Accept null/empty or flag for human review.
When: Developers frequently dismiss automated findings and you want to improve accuracy
→Add detected_pattern fields to structured findings to track which constructs produce false positives.
✗ Anti-Patterns to Reject
- Retrying extraction when the source document does not contain the required information.
- Using generic retry prompts like 'try again' without including the specific validation errors that triggered the retry.
Design efficient batch processing strategies
Key Points
- Message Batches API: 50% cost savings, up to 24-hour processing window, no guaranteed latency SLA.
- Batch processing is appropriate for non-blocking, latency-tolerant workloads (overnight reports, weekly audits, nightly test generation).
- The batch API does NOT support multi-turn tool calling within a single request -- breaks iterative workflows.
- Use custom_id fields for correlating batch request/response pairs and handling failures.
Decision Rules
When: Workflow is latency-sensitive and blocks developers (pre-merge checks)
→Use synchronous API calls, NOT batch processing.
When: Workflow is scheduled and latency-tolerant (overnight reports, weekly audits, nightly test generation)
→Use Message Batches API for 50% cost savings.
When: Workflow requires iterative tool calling (analyze file, request related files, continue analysis)
→Do NOT use batch processing -- it cannot execute tools mid-request and return results.
✗ Anti-Patterns to Reject
- Using batch processing for blocking pre-merge checks where developers are waiting for results.
- Attempting to use batch processing for iterative tool-calling workflows that require mid-request tool execution.
Design multi-instance and multi-pass review architectures
Key Points
- Self-review limitation: a model retains reasoning context from generation, making it less likely to question its own decisions.
- Independent review instances (without prior reasoning context) catch subtle issues that self-review and extended thinking miss.
- Multi-pass review: split into per-file local analysis passes plus cross-file integration passes to avoid attention dilution.
- Include reasoning and confidence assessments inline with each finding to speed up developer triage.
Decision Rules
When: Claude-generated code has subtle issues that only surface during human peer review
→Use a second, independent Claude instance to review without access to the generator's reasoning.
When: Single-pass review of many files produces inconsistent depth and contradictory feedback
→Split into per-file local passes plus a separate cross-file integration pass.
When: Developers spend too much time investigating each finding to decide if it is real
→Require Claude to include reasoning and confidence assessment inline with each finding.
✗ Anti-Patterns to Reject
- Asking Claude to self-review its own output in the same session -- confirmation bias means it rationalizes the same way.
- Using extended thinking as a substitute for independent review -- the same session context still biases the review.