Context Token Management & Caching

Advanced

Manage conversation context to preserve critical information across long interactions · Difficulty 3/5

0%
tokenscachingoptimizationdata-reduction

Tool results accumulate in context and consume tokens disproportionately to their relevance. Prompt Caching and upstream data reduction are key strategies for managing token budgets.

Token Accumulation Problem

A single order lookup may return 40+ fields when only 5 are relevant to the customer's question. Over multiple tool calls in a session, irrelevant fields consume significant context budget.

Solutions

  • Trim tool outputs: Keep only return-relevant fields from tool results before they accumulate in context
  • Upstream data reduction: Modify upstream agents to return structured data (key facts, citations, relevance scores) instead of verbose content and reasoning chains
  • **Prompt Caching**: Cache static system prompts to reduce costs and latency on repeated calls
  • Prompt Caching

  • Mark portions of your prompt as cacheable using cache control headers
  • Cache has a TTL that refreshes on each use
  • Best for: long system prompts, static documentation, few-shot example sets
  • Cache write has a slight premium; cache read provides significant discount
  • Why Not Other Approaches?

  • Vector DB + retrieval: Over-engineered when synthesis needs comprehensive coverage, not selective retrieval
  • Intermediate summarization agent: Adds latency and another potential error point
  • Fix upstream data volume instead of trying to handle large inputs downstream
  • Key Takeaways

    • Trim verbose tool outputs to only relevant fields before they accumulate in context
    • Reduce data volume at the source rather than trying to handle large inputs downstream
    • Cache static prompt content to save costs on repeated calls