Domain 5: Context Management & Reliability

15% of exam

ts-5.1

Manage conversation context to preserve critical information across long interactions

Key Points

Progressive summarization loses precise details: amounts, percentages, dates get condensed into vague phrases.
The 'lost in the middle' effect: models reliably process the beginning and end of long inputs but may omit middle sections.
Tool results accumulate tokens disproportionate to their relevance (e.g., 40+ fields when only 5 are relevant).
Place key findings summaries at the beginning of aggregated inputs; organize detailed results with explicit section headers.

Decision Rules

When: Customer references specific amounts ('the 15% discount I mentioned') that were summarized away

→Extract transactional facts (amounts, dates, order numbers) into a persistent 'case facts' block outside summarized history.

When: Synthesis agent omits critical findings from the middle of 75K+ token aggregated input

→Place a key findings summary at the beginning; organize the rest with explicit section headers.

When: Tool outputs return 40+ fields per lookup when only 5 are relevant

→Trim verbose tool outputs to only relevant fields before they accumulate in context.

✗ Anti-Patterns to Reject

Relying on progressive summarization to preserve exact numerical values and dates from early in a conversation.
Increasing the summarization threshold (e.g., 70% to 85%) instead of extracting critical facts into a persistent block.

ts-5.2

Design effective escalation and ambiguity resolution patterns

Key Points

Appropriate escalation triggers: customer explicitly requests human, policy exceptions/gaps, inability to make meaningful progress.
Escalate immediately when customer explicitly demands a human -- do not first attempt investigation.
Sentiment-based escalation and self-reported confidence scores are unreliable proxies for actual case complexity.
When multiple customer matches are returned, ask for an additional identifier (email, phone, order number) rather than guessing.

Decision Rules

When: Policy is ambiguous or silent on the customer's specific request (e.g., competitor price matching)

→Escalate to a human for policy interpretation -- do not fabricate a policy.

When: get_customer returns multiple matches and the agent guesses wrong 15% of the time

→Instruct the agent to ask for an additional identifier before taking any customer-specific action.

When: The issue is straightforward but the customer explicitly asks for a human agent

→Escalate immediately -- honor the explicit request without attempting to resolve first.

✗ Anti-Patterns to Reject

Using heuristics (most recent order, conversational context clues) to guess the right customer from multiple matches.
Implementing sentiment analysis or self-reported confidence scores as escalation triggers.

ts-5.3

Implement error propagation strategies across multi-agent systems

Key Points

Structured error context (failure type, attempted query, partial results, alternative approaches) enables intelligent coordinator recovery.
Distinguish access failures (timeouts needing retry decisions) from valid empty results (successful queries with no matches).
Silently suppressing errors (returning empty as success) or terminating on single failures are both anti-patterns.
Subagents should handle transient failures locally and only propagate errors they cannot resolve, with partial results.

Decision Rules

When: A subagent encounters a timeout (transient failure)

→Attempt local recovery; if it fails, propagate structured error context (failure type, what was attempted, partial results) to the coordinator.

When: A subagent encounters a corrupted file (permanent failure)

→Return the error with context to the coordinator -- do NOT retry (corruption is permanent).

When: Some source categories succeed while others fail in a multi-source research task

→Proceed with available data; annotate synthesis output with coverage gaps indicating which sources were unavailable.

✗ Anti-Patterns to Reject

Returning empty results marked as 'success' when a timeout occurred, hiding the failure from the coordinator.
Terminating the entire research workflow when one source fails, discarding all successful results.

ts-5.4

Manage context effectively in large codebase exploration

Key Points

Context degradation in extended sessions: models start referencing 'typical patterns' instead of specific classes discovered earlier.
Scratchpad files persist key findings across context boundaries, countering degradation.
Subagent delegation isolates verbose exploration output while the main agent coordinates high-level understanding.
Structured state persistence: each agent exports state to a known location; the coordinator loads a manifest on resume.

Decision Rules

When: Discovery phase generates verbose output that fills the main context window

→Use the Explore subagent or context: fork to isolate verbose output; return a concise summary.

When: Extended exploration session shows signs of context degradation (vague references instead of specifics)

→Have agents maintain scratchpad files recording key findings; use /compact to reduce context usage.

When: Multi-phase task needs to persist findings across context boundaries

→Summarize key findings from one phase before spawning sub-agents for the next; inject summaries into initial context.

✗ Anti-Patterns to Reject

Continuing all phases in the main conversation using /compact repeatedly -- lossy compression discards important details.
Re-exploring the entire codebase from scratch instead of persisting findings in scratchpad files.

ts-5.5

Design human review workflows and confidence calibration

Key Points

Aggregate accuracy metrics (97% overall) may mask poor performance on specific document types or fields.
Use stratified random sampling to measure error rates in high-confidence extractions and detect novel patterns.
Field-level confidence scores should be calibrated using labeled validation sets for routing review attention.
Validate accuracy by document type AND field segment before automating high-confidence extractions.

Decision Rules

When: Overall accuracy is 97% but you suspect some document types perform poorly

→Analyze accuracy by document type and field to identify hidden poor-performing segments.

When: You want to reduce human review overhead on high-confidence extractions

→Implement stratified random sampling of high-confidence outputs; only reduce review after validating by segment.

When: Model outputs field-level confidence scores but they do not correlate with actual accuracy

→Calibrate confidence thresholds using labeled validation sets rather than trusting raw model scores.

✗ Anti-Patterns to Reject

Trusting aggregate accuracy metrics without breaking down performance by document type and field.
Automating all high-confidence extractions without validating that confidence correlates with actual accuracy per segment.

ts-5.6

Preserve information provenance and handle uncertainty in multi-source synthesis

Key Points

Source attribution is lost during summarization if claim-source mappings are not preserved.
Conflicting statistics from credible sources should be annotated with source attribution, not arbitrarily resolved.
Require publication/collection dates in structured outputs to prevent temporal differences from being misinterpreted as contradictions.
Render different content types appropriately: financial data as tables, news as prose, technical findings as structured lists.

Decision Rules

When: Two credible sources report conflicting statistics on a key metric

→Include both values with explicit source attribution; let the coordinator decide how to reconcile before synthesis.

When: Subagent outputs are compressed and downstream agents lose track of which claims came from where

→Require subagents to output structured claim-source mappings (source URLs, document names, excerpts).

When: Data from different time periods appears contradictory

→Require publication/collection dates in structured outputs to enable correct temporal interpretation.

✗ Anti-Patterns to Reject

Applying source credibility heuristics to select one value over another -- this oversteps the subagent's role.
Converting all content types to a uniform format (e.g., all prose) instead of rendering each type appropriately.