Domain 5: Context Management & Reliability
15% of examManage conversation context to preserve critical information across long interactions
Key Points
- Progressive summarization loses precise details: amounts, percentages, dates get condensed into vague phrases.
- The 'lost in the middle' effect: models reliably process the beginning and end of long inputs but may omit middle sections.
- Tool results accumulate tokens disproportionate to their relevance (e.g., 40+ fields when only 5 are relevant).
- Place key findings summaries at the beginning of aggregated inputs; organize detailed results with explicit section headers.
Decision Rules
When: Customer references specific amounts ('the 15% discount I mentioned') that were summarized away
→Extract transactional facts (amounts, dates, order numbers) into a persistent 'case facts' block outside summarized history.
When: Synthesis agent omits critical findings from the middle of 75K+ token aggregated input
→Place a key findings summary at the beginning; organize the rest with explicit section headers.
When: Tool outputs return 40+ fields per lookup when only 5 are relevant
→Trim verbose tool outputs to only relevant fields before they accumulate in context.
✗ Anti-Patterns to Reject
- Relying on progressive summarization to preserve exact numerical values and dates from early in a conversation.
- Increasing the summarization threshold (e.g., 70% to 85%) instead of extracting critical facts into a persistent block.
Design effective escalation and ambiguity resolution patterns
Key Points
- Appropriate escalation triggers: customer explicitly requests human, policy exceptions/gaps, inability to make meaningful progress.
- Escalate immediately when customer explicitly demands a human -- do not first attempt investigation.
- Sentiment-based escalation and self-reported confidence scores are unreliable proxies for actual case complexity.
- When multiple customer matches are returned, ask for an additional identifier (email, phone, order number) rather than guessing.
Decision Rules
When: Policy is ambiguous or silent on the customer's specific request (e.g., competitor price matching)
→Escalate to a human for policy interpretation -- do not fabricate a policy.
When: get_customer returns multiple matches and the agent guesses wrong 15% of the time
→Instruct the agent to ask for an additional identifier before taking any customer-specific action.
When: The issue is straightforward but the customer explicitly asks for a human agent
→Escalate immediately -- honor the explicit request without attempting to resolve first.
✗ Anti-Patterns to Reject
- Using heuristics (most recent order, conversational context clues) to guess the right customer from multiple matches.
- Implementing sentiment analysis or self-reported confidence scores as escalation triggers.
Implement error propagation strategies across multi-agent systems
Key Points
- Structured error context (failure type, attempted query, partial results, alternative approaches) enables intelligent coordinator recovery.
- Distinguish access failures (timeouts needing retry decisions) from valid empty results (successful queries with no matches).
- Silently suppressing errors (returning empty as success) or terminating on single failures are both anti-patterns.
- Subagents should handle transient failures locally and only propagate errors they cannot resolve, with partial results.
Decision Rules
When: A subagent encounters a timeout (transient failure)
→Attempt local recovery; if it fails, propagate structured error context (failure type, what was attempted, partial results) to the coordinator.
When: A subagent encounters a corrupted file (permanent failure)
→Return the error with context to the coordinator -- do NOT retry (corruption is permanent).
When: Some source categories succeed while others fail in a multi-source research task
→Proceed with available data; annotate synthesis output with coverage gaps indicating which sources were unavailable.
✗ Anti-Patterns to Reject
- Returning empty results marked as 'success' when a timeout occurred, hiding the failure from the coordinator.
- Terminating the entire research workflow when one source fails, discarding all successful results.
Manage context effectively in large codebase exploration
Key Points
- Context degradation in extended sessions: models start referencing 'typical patterns' instead of specific classes discovered earlier.
- Scratchpad files persist key findings across context boundaries, countering degradation.
- Subagent delegation isolates verbose exploration output while the main agent coordinates high-level understanding.
- Structured state persistence: each agent exports state to a known location; the coordinator loads a manifest on resume.
Decision Rules
When: Discovery phase generates verbose output that fills the main context window
→Use the Explore subagent or context: fork to isolate verbose output; return a concise summary.
When: Extended exploration session shows signs of context degradation (vague references instead of specifics)
→Have agents maintain scratchpad files recording key findings; use /compact to reduce context usage.
When: Multi-phase task needs to persist findings across context boundaries
→Summarize key findings from one phase before spawning sub-agents for the next; inject summaries into initial context.
✗ Anti-Patterns to Reject
- Continuing all phases in the main conversation using /compact repeatedly -- lossy compression discards important details.
- Re-exploring the entire codebase from scratch instead of persisting findings in scratchpad files.
Design human review workflows and confidence calibration
Key Points
- Aggregate accuracy metrics (97% overall) may mask poor performance on specific document types or fields.
- Use stratified random sampling to measure error rates in high-confidence extractions and detect novel patterns.
- Field-level confidence scores should be calibrated using labeled validation sets for routing review attention.
- Validate accuracy by document type AND field segment before automating high-confidence extractions.
Decision Rules
When: Overall accuracy is 97% but you suspect some document types perform poorly
→Analyze accuracy by document type and field to identify hidden poor-performing segments.
When: You want to reduce human review overhead on high-confidence extractions
→Implement stratified random sampling of high-confidence outputs; only reduce review after validating by segment.
When: Model outputs field-level confidence scores but they do not correlate with actual accuracy
→Calibrate confidence thresholds using labeled validation sets rather than trusting raw model scores.
✗ Anti-Patterns to Reject
- Trusting aggregate accuracy metrics without breaking down performance by document type and field.
- Automating all high-confidence extractions without validating that confidence correlates with actual accuracy per segment.
Preserve information provenance and handle uncertainty in multi-source synthesis
Key Points
- Source attribution is lost during summarization if claim-source mappings are not preserved.
- Conflicting statistics from credible sources should be annotated with source attribution, not arbitrarily resolved.
- Require publication/collection dates in structured outputs to prevent temporal differences from being misinterpreted as contradictions.
- Render different content types appropriately: financial data as tables, news as prose, technical findings as structured lists.
Decision Rules
When: Two credible sources report conflicting statistics on a key metric
→Include both values with explicit source attribution; let the coordinator decide how to reconcile before synthesis.
When: Subagent outputs are compressed and downstream agents lose track of which claims came from where
→Require subagents to output structured claim-source mappings (source URLs, document names, excerpts).
When: Data from different time periods appears contradictory
→Require publication/collection dates in structured outputs to enable correct temporal interpretation.
✗ Anti-Patterns to Reject
- Applying source credibility heuristics to select one value over another -- this oversteps the subagent's role.
- Converting all content types to a uniform format (e.g., all prose) instead of rendering each type appropriately.