Domain 4: Prompt Engineering & Structured Output
20%ts-4.1Design prompts with explicit criteria to improve precision and reduce false positives
- Explicit criteria ('flag comments only when claimed behavior contradicts actual code') beat vague instructions ('check that comments are accurate').
- General instructions like 'be conservative' or 'only report high-confidence findings' fail to improve precision.
- High false positive rates in some categories undermine trust in ALL categories -- developers dismiss everything.
- Define explicit severity criteria with concrete code examples for each severity level to achieve consistent classification.
Decision Rules
When Automated review produces high false positive rates that erode developer trust → Temporarily disable high false-positive categories; keep only high-precision categories while improving prompts.
When Severity ratings are inconsistent across similar issues → Add explicit severity criteria with concrete code examples for each level, not general 'be conservative' instructions.
When A prompt instruction is vague (e.g., 'check comments are accurate') → Replace with explicit criteria defining exactly what constitutes a problem (e.g., 'flag only when claimed behavior contradicts code').
Anti-Patterns
- Adding confidence scores alongside findings and expecting developers to self-triage -- they will not trust self-reported scores.
- Keeping high false-positive categories enabled while 'improving prompts over the coming weeks' -- trust erodes immediately.
ts-4.2Apply few-shot prompting to improve output consistency and quality
- Few-shot examples are the most effective technique when detailed instructions alone produce inconsistent results.
- Target 2-4 examples at ambiguous scenarios showing reasoning for why one action was chosen over alternatives.
- Few-shot examples enable generalization to novel patterns, not just matching pre-specified cases.
- For extraction tasks, few-shot examples reduce hallucination by showing how to handle varied document structures.
Decision Rules
When Detailed format instructions produce variable output quality (sometimes detailed, sometimes vague) → Add 3-4 few-shot examples showing the exact desired format with issue, location, and specific fix.
When Agent misroutes between tools on ambiguous requests → Add 4-6 few-shot examples targeting ambiguous scenarios, each showing reasoning for the tool choice.
When Agent handles individual concerns well (94%) but fails on multi-concern messages (58%) → Add few-shot examples demonstrating correct reasoning and tool sequencing for multi-concern requests.
Anti-Patterns
- Further refining abstract instructions when instructions have already failed -- examples are more reliable than rules.
- Grouping few-shot examples by tool instead of showing comparative reasoning across tools for ambiguous cases.
ts-4.3Enforce structured output using tool use and JSON schemas
- tool_use with JSON schemas is the most reliable approach for guaranteed schema-compliant structured output.
- tool_choice: 'auto' (may return text), 'any' (must call a tool), forced selection (must call a specific tool).
- Strict JSON schemas via tool use eliminate syntax errors but do NOT prevent semantic errors (values in wrong fields, line items not summing).
- Design schema fields as optional (nullable) when source documents may not contain the information, preventing hallucinated values.
Decision Rules
When You need guaranteed structured output with no JSON syntax errors → Define an extraction tool with JSON schema as input parameters; extract data from the tool_use response.
When Multiple extraction schemas exist and the document type is unknown → Set tool_choice: 'any' to guarantee a tool call while letting the model choose which extraction schema.
When Source documents may not contain all required fields → Design those schema fields as optional (nullable) to prevent the model from fabricating values.
Anti-Patterns
- Relying on prompt instructions to produce JSON instead of using tool_use for guaranteed schema compliance.
- Making all schema fields required when source documents may lack the data, causing the model to hallucinate values.
ts-4.4Implement validation, retry, and feedback loops for extraction quality
- Retry-with-error-feedback: append specific validation errors to the prompt on retry to guide the model toward correction.
- Retries are ineffective when required information is simply absent from the source document (vs format or structural errors).
- Track which code constructs trigger findings (detected_pattern field) to enable systematic analysis of dismissal patterns.
- Semantic validation (values don't sum, wrong field placement) requires separate validation logic -- tool use only prevents syntax errors.
Decision Rules
When Extraction output has format or structural errors (wrong nesting, bad date format) → Retry with the original document, the failed extraction, and specific validation errors appended.
When Required data simply does not exist in the source document → Do NOT retry -- retries cannot conjure missing information. Accept null/empty or flag for human review.
When Developers frequently dismiss automated findings and you want to improve accuracy → Add detected_pattern fields to structured findings to track which constructs produce false positives.
Anti-Patterns
- Retrying extraction when the source document does not contain the required information.
- Using generic retry prompts like 'try again' without including the specific validation errors that triggered the retry.
ts-4.5Design efficient batch processing strategies
- Message Batches API: 50% cost savings, up to 24-hour processing window, no guaranteed latency SLA.
- Batch processing is appropriate for non-blocking, latency-tolerant workloads (overnight reports, weekly audits, nightly test generation).
- The batch API does NOT support multi-turn tool calling within a single request -- breaks iterative workflows.
- Use custom_id fields for correlating batch request/response pairs and handling failures.
Decision Rules
When Workflow is latency-sensitive and blocks developers (pre-merge checks) → Use synchronous API calls, NOT batch processing.
When Workflow is scheduled and latency-tolerant (overnight reports, weekly audits, nightly test generation) → Use Message Batches API for 50% cost savings.
When Workflow requires iterative tool calling (analyze file, request related files, continue analysis) → Do NOT use batch processing -- it cannot execute tools mid-request and return results.
Anti-Patterns
- Using batch processing for blocking pre-merge checks where developers are waiting for results.
- Attempting to use batch processing for iterative tool-calling workflows that require mid-request tool execution.
ts-4.6Design multi-instance and multi-pass review architectures
- Self-review limitation: a model retains reasoning context from generation, making it less likely to question its own decisions.
- Independent review instances (without prior reasoning context) catch subtle issues that self-review and extended thinking miss.
- Multi-pass review: split into per-file local analysis passes plus cross-file integration passes to avoid attention dilution.
- Include reasoning and confidence assessments inline with each finding to speed up developer triage.
Decision Rules
When Claude-generated code has subtle issues that only surface during human peer review → Use a second, independent Claude instance to review without access to the generator's reasoning.
When Single-pass review of many files produces inconsistent depth and contradictory feedback → Split into per-file local passes plus a separate cross-file integration pass.
When Developers spend too much time investigating each finding to decide if it is real → Require Claude to include reasoning and confidence assessment inline with each finding.
Anti-Patterns
- Asking Claude to self-review its own output in the same session -- confirmation bias means it rationalizes the same way.
- Using extended thinking as a substitute for independent review -- the same session context still biases the review.
Deep Dives
ts-4.1Design prompts with explicit criteria to improve precision and reduce false positives
ts-4.2Apply few-shot prompting to improve output consistency and quality
ts-4.3Enforce structured output using tool use and JSON schemas
ts-4.4Implement validation, retry, and feedback loops for extraction quality
ts-4.5Design efficient batch processing strategies
ts-4.6Design multi-instance and multi-pass review architectures