Context Token Management & Caching

Advanced

Manage conversation context to preserve critical information across long interactions · Difficulty 3/5

tokenscachingoptimizationdata-reduction

Prerequisites

Tool results accumulate in context and consume tokens disproportionately to their relevance. Prompt Caching and upstream data reduction are key strategies for managing token budgets.

Token Accumulation Problem

A single order lookup may return 40+ fields when only 5 are relevant to the customer's question. Over multiple tool calls in a session, irrelevant fields consume significant context budget.

Solutions

Trim tool outputs: Keep only return-relevant fields from tool results before they accumulate in context

Upstream data reduction: Modify upstream agents to return structured data (key facts, citations, relevance scores) instead of verbose content and reasoning chains

**Prompt Caching**: Cache static system prompts to reduce costs and latency on repeated calls

Prompt Caching

Mark portions of your prompt as cacheable using cache control headers

Cache has a TTL that refreshes on each use

Best for: long system prompts, static documentation, few-shot example sets

Cache write has a slight premium; cache read provides significant discount

Why Not Other Approaches?

Vector DB + retrieval: Over-engineered when synthesis needs comprehensive coverage, not selective retrieval

Intermediate summarization agent: Adds latency and another potential error point

Fix upstream data volume instead of trying to handle large inputs downstream

Key Takeaways

✓Trim verbose tool outputs to only relevant fields before they accumulate in context
✓Reduce data volume at the source rather than trying to handle large inputs downstream
✓Cache static prompt content to save costs on repeated calls

Glossary Terms

Prompt Caching

A Claude API feature that caches frequently-used prompt content (system prompts, large documents, tool definitions) to reduce cost and latency on repeated API calls. Cached tokens are billed at a discounted rate. Cache has a TTL that resets on each use. Must be enabled by marking content with cache_control.

Related Concepts

Lost in the Middle & Position Effects

Models attend best to beginning and end of context, less to the middle

Context Budget Management & Upstream Reduction

Reduce data volume at the source rather than trying to handle large inputs downstream