Domain 5: Context Management & Reliability
15%ts-5.1Manage conversation context to preserve critical information across long interactions
- Progressive summarization loses precise details: amounts, percentages, dates get condensed into vague phrases.
- The 'lost in the middle' effect: models reliably process the beginning and end of long inputs but may omit middle sections.
- Tool results accumulate tokens disproportionate to their relevance (e.g., 40+ fields when only 5 are relevant).
- Place key findings summaries at the beginning of aggregated inputs; organize detailed results with explicit section headers.
Decision Rules
When Customer references specific amounts ('the 15% discount I mentioned') that were summarized away → Extract transactional facts (amounts, dates, order numbers) into a persistent 'case facts' block outside summarized history.
When Synthesis agent omits critical findings from the middle of 75K+ token aggregated input → Place a key findings summary at the beginning; organize the rest with explicit section headers.
When Tool outputs return 40+ fields per lookup when only 5 are relevant → Trim verbose tool outputs to only relevant fields before they accumulate in context.
Anti-Patterns
- Relying on progressive summarization to preserve exact numerical values and dates from early in a conversation.
- Increasing the summarization threshold (e.g., 70% to 85%) instead of extracting critical facts into a persistent block.
ts-5.2Design effective escalation and ambiguity resolution patterns
- Appropriate escalation triggers: customer explicitly requests human, policy exceptions/gaps, inability to make meaningful progress.
- Escalate immediately when customer explicitly demands a human -- do not first attempt investigation.
- Sentiment-based escalation and self-reported confidence scores are unreliable proxies for actual case complexity.
- When multiple customer matches are returned, ask for an additional identifier (email, phone, order number) rather than guessing.
Decision Rules
When Policy is ambiguous or silent on the customer's specific request (e.g., competitor price matching) → Escalate to a human for policy interpretation -- do not fabricate a policy.
When get_customer returns multiple matches and the agent guesses wrong 15% of the time → Instruct the agent to ask for an additional identifier before taking any customer-specific action.
When The issue is straightforward but the customer explicitly asks for a human agent → Escalate immediately -- honor the explicit request without attempting to resolve first.
Anti-Patterns
- Using heuristics (most recent order, conversational context clues) to guess the right customer from multiple matches.
- Implementing sentiment analysis or self-reported confidence scores as escalation triggers.
ts-5.3Implement error propagation strategies across multi-agent systems
- Structured error context (failure type, attempted query, partial results, alternative approaches) enables intelligent coordinator recovery.
- Distinguish access failures (timeouts needing retry decisions) from valid empty results (successful queries with no matches).
- Silently suppressing errors (returning empty as success) or terminating on single failures are both anti-patterns.
- Subagents should handle transient failures locally and only propagate errors they cannot resolve, with partial results.
Decision Rules
When A subagent encounters a timeout (transient failure) → Attempt local recovery; if it fails, propagate structured error context (failure type, what was attempted, partial results) to the coordinator.
When A subagent encounters a corrupted file (permanent failure) → Return the error with context to the coordinator -- do NOT retry (corruption is permanent).
When Some source categories succeed while others fail in a multi-source research task → Proceed with available data; annotate synthesis output with coverage gaps indicating which sources were unavailable.
Anti-Patterns
- Returning empty results marked as 'success' when a timeout occurred, hiding the failure from the coordinator.
- Terminating the entire research workflow when one source fails, discarding all successful results.
ts-5.4Manage context effectively in large codebase exploration
- Context degradation in extended sessions: models start referencing 'typical patterns' instead of specific classes discovered earlier.
- Scratchpad files persist key findings across context boundaries, countering degradation.
- Subagent delegation isolates verbose exploration output while the main agent coordinates high-level understanding.
- Structured state persistence: each agent exports state to a known location; the coordinator loads a manifest on resume.
Decision Rules
When Discovery phase generates verbose output that fills the main context window → Use the Explore subagent or context: fork to isolate verbose output; return a concise summary.
When Extended exploration session shows signs of context degradation (vague references instead of specifics) → Have agents maintain scratchpad files recording key findings; use /compact to reduce context usage.
When Multi-phase task needs to persist findings across context boundaries → Summarize key findings from one phase before spawning sub-agents for the next; inject summaries into initial context.
Anti-Patterns
- Continuing all phases in the main conversation using /compact repeatedly -- lossy compression discards important details.
- Re-exploring the entire codebase from scratch instead of persisting findings in scratchpad files.
ts-5.5Design human review workflows and confidence calibration
- Aggregate accuracy metrics (97% overall) may mask poor performance on specific document types or fields.
- Use stratified random sampling to measure error rates in high-confidence extractions and detect novel patterns.
- Field-level confidence scores should be calibrated using labeled validation sets for routing review attention.
- Validate accuracy by document type AND field segment before automating high-confidence extractions.
Decision Rules
When Overall accuracy is 97% but you suspect some document types perform poorly → Analyze accuracy by document type and field to identify hidden poor-performing segments.
When You want to reduce human review overhead on high-confidence extractions → Implement stratified random sampling of high-confidence outputs; only reduce review after validating by segment.
When Model outputs field-level confidence scores but they do not correlate with actual accuracy → Calibrate confidence thresholds using labeled validation sets rather than trusting raw model scores.
Anti-Patterns
- Trusting aggregate accuracy metrics without breaking down performance by document type and field.
- Automating all high-confidence extractions without validating that confidence correlates with actual accuracy per segment.
ts-5.6Preserve information provenance and handle uncertainty in multi-source synthesis
- Source attribution is lost during summarization if claim-source mappings are not preserved.
- Conflicting statistics from credible sources should be annotated with source attribution, not arbitrarily resolved.
- Require publication/collection dates in structured outputs to prevent temporal differences from being misinterpreted as contradictions.
- Render different content types appropriately: financial data as tables, news as prose, technical findings as structured lists.
Decision Rules
When Two credible sources report conflicting statistics on a key metric → Include both values with explicit source attribution; let the coordinator decide how to reconcile before synthesis.
When Subagent outputs are compressed and downstream agents lose track of which claims came from where → Require subagents to output structured claim-source mappings (source URLs, document names, excerpts).
When Data from different time periods appears contradictory → Require publication/collection dates in structured outputs to enable correct temporal interpretation.
Anti-Patterns
- Applying source credibility heuristics to select one value over another -- this oversteps the subagent's role.
- Converting all content types to a uniform format (e.g., all prose) instead of rendering each type appropriately.
Deep Dives
ts-5.1Manage conversation context to preserve critical information across long interactions
ts-5.2Design effective escalation and ambiguity resolution patterns
ts-5.3Implement error propagation strategies across multi-agent systems
ts-5.4Manage context effectively in large codebase exploration
ts-5.5Design human review workflows and confidence calibration
ts-5.6Preserve information provenance and handle uncertainty in multi-source synthesis