AgentsArchitectureClaude Code

Why Multi-Session AI Agents Fail — And the Harness Architecture That Fixes Each Problem

February 20, 2025·6 min read

A single-session AI agent that generates a report or rewrites a function is straightforward to build. A multi-session agent that completes a week-long software project — across context window boundaries, with no persistent memory — requires deliberate architectural design. Without it, the agent will fail in one of four predictable ways, often silently.

What Is an Agent Harness?

An agent harness is the execution framework surrounding the model: state storage, session handoff protocols, task tracking structures, and environment setup scripts. The model itself handles reasoning and generation; the harness ensures that each new session starts with a complete, accurate picture of where the project stands and what needs to happen next. Without a harness, each session starts from scratch.

Four Failure Modes and Their Fixes

1. Over-Ambition (One-Shotting)

The agent attempts to build an entire application in a single session. The context window fills at 60% completion, the session ends, and the next session opens to a half-finished codebase with no documentation of intent or progress. Fix: an initializer agent runs once at project start. It creates a granular feature list — saved as a JSON file — with every feature marked as failing. This becomes the persistent roadmap every subsequent session reads before taking any action.

2. False Completion

A later session scans the codebase, observes functioning components, and concludes the task is complete — leaving 40% of features unimplemented and untested. Fix: the JSON feature list is the source of truth, not the agent's impression of the code. A feature is only marked passing: true after explicit end-to-end testing. Agents are not permitted to remove features from the list to make progress look better.

3. Environmental Chaos

An agent makes changes, hits errors, partially reverts, makes further changes, and ends the session without documenting what it did. The next session spends its first third of context diagnosing the current state. Fix: every session must commit with a descriptive Git message before ending, and update claude-progress.txt with a human-readable status summary. Git history becomes a recoverable timeline; the progress file becomes a structured handoff.

4. Testing Gaps

An agent marks a feature complete because its own API calls return success responses. The actual user-facing UI for that feature is broken. Fix: browser automation tools (such as Puppeteer via MCP) allow the agent to navigate the application as a user would, catching integration failures that unit tests and API assertions cannot see.

The Two-Agent Pattern

The harness architecture separates initialization from ongoing work:

Initializer agent (first session only): creates init.sh, generates the JSON feature list, makes the initial Git commit, writes the first claude-progress.txt entry
Coding agent (all subsequent sessions): reads progress file and Git log → runs basic end-to-end tests → selects highest-priority incomplete feature → implements it → tests → commits → updates feature JSON → updates progress file

json

{
  "features": [
    {
      "id": "auth-login",
      "description": "User can sign in with email and password",
      "priority": 1,
      "passes": false
    },
    {
      "id": "dashboard-load",
      "description": "Dashboard renders last 10 activity items on load",
      "priority": 2,
      "passes": false
    }
  ]
}

Use JSON for the feature list rather than Markdown. JSON is easier for agents to parse reliably, harder to accidentally corrupt, and enforces a clear schema where only the 'passes' field should change.

Git as a Checkpoint System

In the harness model, Git serves a dual purpose: it is both a version control system and a session-to-session recovery mechanism. Each commit represents a known-good state. If a session ends with broken tests, the next session can run git log --oneline and git diff to understand exactly what changed and why the tests broke — without needing any other documentation. Version control becomes the agent's long-term memory.

Exam Relevance: Agentic Architecture Domain

The agentic architecture domain (27% of the Claude SA exam) directly tests harness design knowledge. Expect questions on: when to use an initializer agent vs a single coding agent, what state to persist across context windows, how to prevent context exhaustion in long tasks, and how to structure multi-session handoffs. The feature list plus progress file plus Git commit pattern is a canonical exam-ready answer to questions about maintaining continuity across context window boundaries.

Preparing for the Claude SA Exam?

Explore 150+ exam concepts, 91 glossary terms, and full mock exams — all free.

Browse Concept Library View Exam Guide

← Back to all articles