Prompt Engineering & Structured Output
Design prompts with explicit criteria, apply few-shot patterns, enforce structured output via JSON schemas, implement validation loops, design batch processing strategies, and architect multi-instance reviews.
Design prompts with explicit criteria to improve precision and reduce false positives
Writing specific, categorical prompt criteria that improve precision and reduce false positive rates.
Knowledge of:
- The importance of explicit criteria over vague instructions (e.g., "flag comments only when claimed behavior contradicts actual code behavior" vs "check that comments are accurate")
- How general instructions like "be conservative" or "only report high-confidence findings" fail to improve precision compared to specific categorical criteria
- The impact of false positive rates on developer trust: high false positive categories undermine confidence in accurate categories
Skills in:
- Writing specific review criteria that define which issues to report (bugs, security) versus skip (minor style, local patterns) rather than relying on confidence-based filtering
- Temporarily disabling high false-positive categories to restore developer trust while improving prompts for those categories
- Defining explicit severity criteria with concrete code examples for each severity level to achieve consistent classification
Explicit Criteria over Vague Instructions
✎CoreReplace vague goals with specific, categorical criteria the model can apply deterministically
Prompt Specificity & Precision
✎CoreReplace vague goals with specific, actionable criteria
Classification Consistency & False Positive Reduction
✎CoreUse absolute criteria with concrete examples for each classification level
Apply few-shot prompting to improve output consistency and quality
Using targeted few-shot examples to achieve consistent formatting, handle ambiguous cases, and reduce hallucination.
Knowledge of:
- Few-shot examples as the most effective technique for achieving consistently formatted, actionable output when detailed instructions alone produce inconsistent results
- The role of few-shot examples in demonstrating ambiguous-case handling (e.g., tool selection for ambiguous requests, branch-level test coverage gaps)
- How few-shot examples enable the model to generalize judgment to novel patterns rather than matching only pre-specified cases
- The effectiveness of few-shot examples for reducing hallucination in extraction tasks (e.g., handling informal measurements, varied document structures)
Skills in:
- Creating 2-4 targeted few-shot examples for ambiguous scenarios that show reasoning for why one action was chosen over plausible alternatives
- Including few-shot examples that demonstrate specific desired output format (issue, severity, suggested fix) to achieve consistency
- Providing few-shot examples distinguishing acceptable code patterns from genuine issues to reduce false positives while enabling generalization
- Using few-shot examples to demonstrate correct handling of varied document structures (inline citations vs bibliographies, methodology sections vs embedded details)
- Adding few-shot examples showing correct extraction from documents with varied formats to address empty/null extraction of required fields
Few-Shot Prompting Techniques
✎CoreFew-shot examples are more reliable than instructions for consistent formatting
Concrete Input-Output Examples
✎CoreConcrete examples eliminate ambiguity that prose descriptions create
Enforce structured output using tool use and JSON schemas
Using tool_use with JSON schemas for guaranteed structured output, understanding tool_choice options, and schema design.
Knowledge of:
- Tool use (tool_use) with JSON schemas as the most reliable approach for guaranteed schema-compliant structured output, eliminating JSON syntax errors
- The distinction between tool_choice: "auto" (model may return text instead of calling a tool), "any" (model must call a tool but can choose which), and forced tool selection (model must call a specific named tool)
- That strict JSON schemas via tool use eliminate syntax errors but do not prevent semantic errors (e.g., line items that don't sum to total, values in wrong fields)
- Schema design considerations: required vs optional fields, enum fields with "other" + detail string patterns for extensible categories
Skills in:
- Defining extraction tools with JSON schemas as input parameters and extracting structured data from the tool_use response
- Setting tool_choice: "any" to guarantee structured output when multiple extraction schemas exist and the document type is unknown
- Forcing a specific tool with tool_choice: {"type": "tool", "name": "extract_metadata"} to ensure a particular extraction runs before enrichment steps
- Designing schema fields as optional (nullable) when source documents may not contain the information, preventing the model from fabricating values to satisfy required fields
- Adding enum values like "unclear" for ambiguous cases and "other" + detail fields for extensible categorization
- Including format normalization rules in prompts alongside strict output schemas to handle inconsistent source formatting
Structured Output via Tool Use & JSON Schemas
✎CoreTool use with JSON schemas eliminates syntax errors but not semantic errors
tool_choice Options & Forced Tool Selection
✎Coretool_choice: 'auto' may return text; 'any' guarantees a tool call; forced selection guarantees a specific tool
Implement validation, retry, and feedback loops for extraction quality
Designing retry-with-error-feedback loops, identifying when retries will succeed vs fail, and building systematic feedback mechanisms.
Knowledge of:
- Retry-with-error-feedback: appending specific validation errors to the prompt on retry to guide the model toward correction
- The limits of retry: retries are ineffective when the required information is simply absent from the source document (vs format or structural errors)
- Feedback loop design: tracking which code constructs trigger findings (detected_pattern field) to enable systematic analysis of dismissal patterns
- The difference between semantic validation errors (values don't sum, wrong field placement) and schema syntax errors (eliminated by tool use)
Skills in:
- Implementing follow-up requests that include the original document, the failed extraction, and specific validation errors for model self-correction
- Identifying when retries will be ineffective (e.g., information exists only in an external document not provided) versus when they will succeed (format mismatches, structural output errors)
- Adding detected_pattern fields to structured findings to enable analysis of false positive patterns when developers dismiss findings
- Designing self-correction validation flows: extracting "calculated_total" alongside "stated_total" to flag discrepancies, adding "conflict_detected" booleans for inconsistent source data
Retry-with-Error-Feedback Pattern
✎CoreAppend specific validation errors to the retry prompt -- not just 'try again'
Feedback Loop Design & Dismissal Pattern Analysis
✓AdvancedAdd detected_pattern fields to enable systematic analysis of false positive patterns
Design efficient batch processing strategies
Matching API approach to workflow latency requirements, handling batch failures, and optimizing batch submission.
Knowledge of:
- The Message Batches API: 50% cost savings, up to 24-hour processing window, no guaranteed latency SLA
- Batch processing is appropriate for non-blocking, latency-tolerant workloads (overnight reports, weekly audits, nightly test generation) and inappropriate for blocking workflows (pre-merge checks)
- The batch API does not support multi-turn tool calling within a single request (cannot execute tools mid-request and return results)
- custom_id fields for correlating batch request/response pairs
Skills in:
- Matching API approach to workflow latency requirements: synchronous API for blocking pre-merge checks, batch API for overnight/weekly analysis
- Calculating batch submission frequency based on SLA constraints (e.g., 4-hour windows to guarantee 30-hour SLA with 24-hour batch processing)
- Handling batch failures: resubmitting only failed documents (identified by custom_id) with appropriate modifications (e.g., chunking documents that exceeded context limits)
- Using prompt refinement on a sample set before batch-processing large volumes to maximize first-pass success rates and reduce iterative resubmission costs
Batch Processing Strategy & API Selection
✎CoreBatch API saves 50% but has up to 24-hour processing with no latency SLA
Batch Failure Handling & Constraints
✎CoreResubmit only failed documents identified by custom_id, not the entire batch
Batch Cost Optimization Strategies
✓Advanced50% batch savings are reduced by resubmission costs -- maximize first-pass success
Design multi-instance and multi-pass review architectures
Using independent review instances and multi-pass strategies to catch issues that self-review misses.
Knowledge of:
- Self-review limitations: a model retains reasoning context from generation, making it less likely to question its own decisions in the same session
- Independent review instances (without prior reasoning context) are more effective at catching subtle issues than self-review instructions or extended thinking
- Multi-pass review: splitting large reviews into per-file local analysis passes plus cross-file integration passes to avoid attention dilution and contradictory findings
Skills in:
- Using a second independent Claude instance to review generated code without the generator's reasoning context
- Splitting large multi-file reviews into focused per-file passes for local issues plus separate integration passes for cross-file data flow analysis
- Running verification passes where the model self-reports confidence alongside each finding to enable calibrated review routing
Self-Critique Limitations & Independent Review
✎CoreSelf-review in the same context suffers from confirmation bias -- the model retains generation reasoning
Multi-Pass Review Architecture
✎CoreSplit large reviews into per-file local passes plus cross-file integration passes