Domain 4: Prompt Engineering & Structured Output

20% of exam

ts-4.1

Design prompts with explicit criteria to improve precision and reduce false positives

Key Points

Explicit criteria ('flag comments only when claimed behavior contradicts actual code') beat vague instructions ('check that comments are accurate').
General instructions like 'be conservative' or 'only report high-confidence findings' fail to improve precision.
High false positive rates in some categories undermine trust in ALL categories -- developers dismiss everything.
Define explicit severity criteria with concrete code examples for each severity level to achieve consistent classification.

Decision Rules

When: Automated review produces high false positive rates that erode developer trust

→Temporarily disable high false-positive categories; keep only high-precision categories while improving prompts.

When: Severity ratings are inconsistent across similar issues

→Add explicit severity criteria with concrete code examples for each level, not general 'be conservative' instructions.

When: A prompt instruction is vague (e.g., 'check comments are accurate')

→Replace with explicit criteria defining exactly what constitutes a problem (e.g., 'flag only when claimed behavior contradicts code').

✗ Anti-Patterns to Reject

Adding confidence scores alongside findings and expecting developers to self-triage -- they will not trust self-reported scores.
Keeping high false-positive categories enabled while 'improving prompts over the coming weeks' -- trust erodes immediately.

ts-4.2

Apply few-shot prompting to improve output consistency and quality

Key Points

Few-shot examples are the most effective technique when detailed instructions alone produce inconsistent results.
Target 2-4 examples at ambiguous scenarios showing reasoning for why one action was chosen over alternatives.
Few-shot examples enable generalization to novel patterns, not just matching pre-specified cases.
For extraction tasks, few-shot examples reduce hallucination by showing how to handle varied document structures.

Decision Rules

When: Detailed format instructions produce variable output quality (sometimes detailed, sometimes vague)

→Add 3-4 few-shot examples showing the exact desired format with issue, location, and specific fix.

When: Agent misroutes between tools on ambiguous requests

→Add 4-6 few-shot examples targeting ambiguous scenarios, each showing reasoning for the tool choice.

When: Agent handles individual concerns well (94%) but fails on multi-concern messages (58%)

→Add few-shot examples demonstrating correct reasoning and tool sequencing for multi-concern requests.

✗ Anti-Patterns to Reject

Further refining abstract instructions when instructions have already failed -- examples are more reliable than rules.
Grouping few-shot examples by tool instead of showing comparative reasoning across tools for ambiguous cases.

ts-4.3

Enforce structured output using tool use and JSON schemas

Key Points

tool_use with JSON schemas is the most reliable approach for guaranteed schema-compliant structured output.
tool_choice: 'auto' (may return text), 'any' (must call a tool), forced selection (must call a specific tool).
Strict JSON schemas via tool use eliminate syntax errors but do NOT prevent semantic errors (values in wrong fields, line items not summing).
Design schema fields as optional (nullable) when source documents may not contain the information, preventing hallucinated values.

Decision Rules

When: You need guaranteed structured output with no JSON syntax errors

→Define an extraction tool with JSON schema as input parameters; extract data from the tool_use response.

When: Multiple extraction schemas exist and the document type is unknown

→Set tool_choice: 'any' to guarantee a tool call while letting the model choose which extraction schema.

When: Source documents may not contain all required fields

→Design those schema fields as optional (nullable) to prevent the model from fabricating values.

✗ Anti-Patterns to Reject

Relying on prompt instructions to produce JSON instead of using tool_use for guaranteed schema compliance.
Making all schema fields required when source documents may lack the data, causing the model to hallucinate values.

ts-4.4

Implement validation, retry, and feedback loops for extraction quality

Key Points

Retry-with-error-feedback: append specific validation errors to the prompt on retry to guide the model toward correction.
Retries are ineffective when required information is simply absent from the source document (vs format or structural errors).
Track which code constructs trigger findings (detected_pattern field) to enable systematic analysis of dismissal patterns.
Semantic validation (values don't sum, wrong field placement) requires separate validation logic -- tool use only prevents syntax errors.

Decision Rules

When: Extraction output has format or structural errors (wrong nesting, bad date format)

→Retry with the original document, the failed extraction, and specific validation errors appended.

When: Required data simply does not exist in the source document

→Do NOT retry -- retries cannot conjure missing information. Accept null/empty or flag for human review.

When: Developers frequently dismiss automated findings and you want to improve accuracy

→Add detected_pattern fields to structured findings to track which constructs produce false positives.

✗ Anti-Patterns to Reject

Retrying extraction when the source document does not contain the required information.
Using generic retry prompts like 'try again' without including the specific validation errors that triggered the retry.

ts-4.5

Design efficient batch processing strategies

Key Points

Message Batches API: 50% cost savings, up to 24-hour processing window, no guaranteed latency SLA.
Batch processing is appropriate for non-blocking, latency-tolerant workloads (overnight reports, weekly audits, nightly test generation).
The batch API does NOT support multi-turn tool calling within a single request -- breaks iterative workflows.
Use custom_id fields for correlating batch request/response pairs and handling failures.

Decision Rules

When: Workflow is latency-sensitive and blocks developers (pre-merge checks)

→Use synchronous API calls, NOT batch processing.

When: Workflow is scheduled and latency-tolerant (overnight reports, weekly audits, nightly test generation)

→Use Message Batches API for 50% cost savings.

When: Workflow requires iterative tool calling (analyze file, request related files, continue analysis)

→Do NOT use batch processing -- it cannot execute tools mid-request and return results.

✗ Anti-Patterns to Reject

Using batch processing for blocking pre-merge checks where developers are waiting for results.
Attempting to use batch processing for iterative tool-calling workflows that require mid-request tool execution.

ts-4.6

Design multi-instance and multi-pass review architectures

Key Points

Self-review limitation: a model retains reasoning context from generation, making it less likely to question its own decisions.
Independent review instances (without prior reasoning context) catch subtle issues that self-review and extended thinking miss.
Multi-pass review: split into per-file local analysis passes plus cross-file integration passes to avoid attention dilution.
Include reasoning and confidence assessments inline with each finding to speed up developer triage.

Decision Rules

When: Claude-generated code has subtle issues that only surface during human peer review

→Use a second, independent Claude instance to review without access to the generator's reasoning.

When: Single-pass review of many files produces inconsistent depth and contradictory feedback

→Split into per-file local passes plus a separate cross-file integration pass.

When: Developers spend too much time investigating each finding to decide if it is real

→Require Claude to include reasoning and confidence assessment inline with each finding.

✗ Anti-Patterns to Reject

Asking Claude to self-review its own output in the same session -- confirmation bias means it rationalizes the same way.
Using extended thinking as a substitute for independent review -- the same session context still biases the review.

Code & Comparisons

ts-4.2

Apply few-shot prompting to improve output consistency and quality

ts-4.3

Enforce structured output using tool use and JSON schemas

ts-4.5