MCPPerformanceToken Efficiency

MCP + Code Execution: How AI Agents Slash Token Usage by 98% on Data-Heavy Tasks

March 5, 2025·5 min read

Context window efficiency is one of the most consequential design decisions in production AI agent architecture. Two patterns drain it unnecessarily: loading every tool definition into context at startup, and allowing intermediate data results to accumulate in the model's attention window even after they have been processed. Both problems have architectural solutions.

The Token Overhead Problem

In a conventional MCP setup, every server's tool definitions — names, descriptions, JSON Schemas — load into context at the start of each session. A modest five-server configuration covering code repositories, messaging, analytics, CRM, and file storage can consume 55,000 to 150,000 tokens before the first user message is processed. On top of this, when an agent fetches data from those tools — paginated API results, database rows, log lines — those results flow into context and stay there, even after they are no longer needed for reasoning.

Discovery-First Tool Loading

An alternative approach treats MCP servers as a discoverable filesystem rather than a pre-loaded registry. Servers are organized in a directory tree where each server has its own subdirectory containing individual tool files:

text

servers/
├── github/
│   ├── createPullRequest.ts
│   └── index.ts
├── salesforce/
│   ├── updateRecord.ts
│   └── index.ts
└── analytics/
    ├── queryMetrics.ts
    └── index.ts

Instead of loading all definitions upfront, the agent navigates the filesystem to discover which servers exist, then reads only the specific tool file needed for the current step. An optional search_tools capability enables natural-language filtering — returning just tool names, descriptions, or full schemas depending on what the agent needs. Published benchmarks show this approach reduces definition overhead from 150,000 tokens to roughly 2,000 — a 98.7% reduction.

Filtering Data Before It Reaches the Model

For data-intensive operations, the agent writes and executes filtering code locally rather than pulling raw results into the model's context. Consider a spreadsheet with 10,000 rows representing orders in various fulfillment states. Instead of returning all 10,000 rows to the model, the agent executes a script that filters locally and returns only the 5 rows with status 'pending'. The model sees a clean, targeted result rather than a massive data dump it would have to reason over.

This pattern also applies to log analysis, database queries, and any paginated API where the raw result set is orders of magnitude larger than the relevant subset.

State Persistence for Multi-Step Workflows

When a workflow spans multiple agent sessions, intermediate results can be written to files rather than held in context. If a session ends mid-task, the next session reads the saved state and continues from the last checkpoint. Frequently reused operations can be formalized as skills — self-contained function files with a SKILL.md documentation entry — allowing agents to build a growing library of tested, reusable sub-routines without re-implementing logic across sessions.

Implementation Tradeoffs

Code execution environments require their own sandboxing, resource limits, and monitoring infrastructure. The operational cost of running a secure execution environment must be weighed against the token savings and latency improvements. For workflows where the model genuinely needs to reason over intermediate data — not just filter it — keeping results in context may be the right choice. The pattern is most valuable for aggregation tasks, large dataset filtering, and multi-step workflows with resumable state requirements.

Preparing for the Claude SA Exam?

Explore 150+ exam concepts, 91 glossary terms, and full mock exams — all free.

Browse Concept Library View Exam Guide

← Back to all articles