Claude Deep Research: Definitive guide to Claude coding agent architecture

The most effective Claude agent systems in 2025-2026 share a counterintuitive trait: radical simplicity. Anthropic's own engineering teams, after building production multi-agent systems, consistently find that a single-threaded tool-call loop, gather context -> act -> verify -> repeat, outperforms elaborate multi-agent architectures for most coding tasks. The key differentiator is not architectural complexity but context engineering: what information enters the context window, when, and how it's managed over long-running sessions. This report synthesizes Anthropic's published guidance, their internal engineering practices (including Claude Code's architecture), and practitioner patterns into actionable blueprints for building production-grade Claude agent systems on raw SDK.

1. Agent architecture: start with one loop, scale with subagents

The foundational building block for every Claude agent system is the augmented LLM, a model enhanced with retrieval, tools, and memory. Anthropic draws a hard line between workflows (predefined code paths orchestrating LLM calls) and agents (LLMs dynamically directing their own processes). Their recommendation: exhaust workflow patterns before building autonomous agents.

The canonical agent loop

Claude Code's internal architecture, extensively reverse-engineered by the community and confirmed by Anthropic's engineering posts, reveals a single main thread with a flat message list:

import anthropic, json

client = anthropic.Anthropic()

def agent_loop(user_message: str, system_prompt: str, tools: list,
               tool_executors: dict, max_iterations: int = 25) -> str:
    messages = [{"role": "user", "content": user_message}]

    for _ in range(max_iterations):
        response = client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=16384,
            system=system_prompt,
            tools=tools,
            messages=messages,
        )
        messages.append({"role": "assistant", "content": response.content})

        tool_calls = [b for b in response.content if b.type == "tool_use"]
        if not tool_calls:
            return next(b.text for b in response.content
                        if hasattr(b, "text"))

        tool_results = []
        for tc in tool_calls:
            try:
                result = tool_executors[tc.name](tc.input)
                tool_results.append({"type": "tool_result",
                                     "tool_use_id": tc.id,
                                     "content": result})
            except Exception as e:
                tool_results.append({"type": "tool_result",
                                     "tool_use_id": tc.id,
                                     "content": str(e),
                                     "is_error": True})
        messages.append({"role": "user", "content": tool_results})

    return "Max iterations reached."

This pattern, 50 lines handling 80% of agent use cases, is what Anthropic recommends as a starting point. The loop terminates when Claude produces a text response with no tool calls (stop_reason: "end_turn"). Claude can return multiple tool calls simultaneously for parallel execution.

When to graduate to multi-agent

Anthropic's January 2026 guidance identifies exactly three scenarios where multi-agent consistently outperforms single-agent:

Context protection, when subtasks generate >1,000 tokens of intermediate context irrelevant to subsequent steps. Subagents provide isolation: each runs in a clean context window and returns only a compressed summary.
Parallelization, when searching across a large information space. Their Research system (Opus 4 lead + Sonnet 4 subagents) outperformed single-agent Opus 4 by 90.2% on internal benchmarks.
Specialization, when an agent accumulates 20+ tools or requires conflicting system prompts (empathetic support agent vs. precise code reviewer).

The cost reality is stark: agents use ~4x more tokens than chat; multi-agent systems use ~15x more. Token usage alone explains 80% of performance variance in Anthropic's research evaluations. Multi-agent is an economic decision, reserve it for tasks where the value per completion justifies the cost.

Subagent hierarchy design

Claude Code enforces a single-level subagent hierarchy: the main agent can spawn subagents, but subagents cannot spawn their own. This prevents recursive explosion. Custom subagents are defined as Markdown files with YAML frontmatter:

---
name: code-reviewer
description: Reviews changes for bugs, regressions, and missing tests.
  MUST BE USED for any PR review or code audit task.
model: haiku
tools:
  - Read
  - Grep
  - Glob
---
You are a code review specialist. Analyze changes for:
- Potential bugs and regressions
- Missing test coverage
- Style consistency with project conventions

Output a structured review with severity ratings.

The description field drives auto-delegation, Claude decides whether to invoke a subagent based on this text. Action-oriented descriptions with trigger phrases ("MUST BE USED," "PROACTIVELY invoke for") improve delegation reliability. The model field enables per-subagent model routing: Haiku for read-heavy review, Sonnet for implementation, Opus for architecture decisions.

2. Five orchestration patterns and when each applies

Anthropic's December 2024 "Building Effective Agents" post defines five workflow patterns as building blocks. These compose into more complex systems.

Pattern	Mechanism	Best for	Claude-specific notes
Prompt chaining	Sequential LLM calls; output of step N feeds step N+1	Tasks with clear sequential subtasks and validation gates	Add "gate" checks between steps to catch errors early
Routing	Classifier directs input to specialized handlers	Distinct categories needing different tools/prompts	Use Haiku as the router (within 2-5% of Sonnet accuracy, 73% cheaper)
Parallelization	Simultaneous LLM calls, sectioning (split task) or voting (same task, multiple attempts)	Independent subtasks; content moderation; diverse solution generation	Claude supports parallel tool calls natively; `asyncio.gather()` for fan-out
Orchestrator-worker	Central LLM dynamically decomposes and delegates	Complex tasks where subtasks aren't known upfront	Primary pattern for Claude Research; lead agent needs detailed delegation prompts
Evaluator-optimizer	Generator + evaluator in iterative loop	Tasks with clear quality criteria and measurable refinement	PwC reports 7x accuracy improvement (10% to 70%) through structured validation loops

Implementation without a framework

Anthropic explicitly recommends raw API/SDK first: "Many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what's under the hood are a common source of customer error."

For the orchestrator-worker pattern with the raw SDK:

import asyncio
from anthropic import AsyncAnthropic

client = AsyncAnthropic()

async def orchestrate(query: str, subtasks: list[str]) -> str:
    # Fan-out: parallel subagent calls
    async def run_subagent(task: str) -> str:
        response = await client.messages.create(
            model="claude-sonnet-4-6",
            max_tokens=4096,
            system="You are a focused research agent. Complete the "
                   "assigned task thoroughly.",
            messages=[{"role": "user", "content": task}],
            tools=[web_search_tool, file_read_tool],
        )
        return extract_text(response)

    results = await asyncio.gather(
        *[run_subagent(t) for t in subtasks]
    )

    # Synthesize: lead agent merges findings
    synthesis = await client.messages.create(
        model="claude-opus-4-6",
        max_tokens=8192,
        system="Synthesize these research findings into a coherent "
               "analysis.",
        messages=[{"role": "user",
                   "content": f"Query: {query}\n\nFindings:\n" +
                              "\n---\n".join(results)}],
    )
    return extract_text(synthesis)

The Claude Agent SDK (renamed from Claude Code SDK, September 2025) provides the same agent loop with built-in tools (Read, Write, Edit, Bash, Glob, Grep) and automatic compaction. It's a middle ground between raw API and full frameworks:

from claude_agent_sdk import query, ClaudeAgentOptions

async for message in query(
    prompt="Fix the authentication bug in auth.py",
    options=ClaudeAgentOptions(
        allowed_tools=["Read", "Edit", "Bash", "Grep"],
        model="claude-sonnet-4-6",
    ),
):
    print(message)

3. Model selection: the 80/10/10 rule

The current Claude lineup as of March 2026 creates a clear cost-performance hierarchy for agent systems:

Model	Input $/MTok	Output $/MTok	Context	Latency	Key capability
Opus 4.6	$5	$25	200K (1M beta)	Slowest (fast mode: $30/$150)	Deepest reasoning, adaptive thinking, agent teams
Sonnet 4.6	$3	$15	200K (1M beta)	Responsive	Near-Opus quality (OSWorld 72.5% vs 72.7%), adaptive thinking
Haiku 4.5	$1	$5	200K	Sub-second	90% of Sonnet performance at 1/3 cost

The practical allocation: Sonnet 4.6 for 80-90% of agent tasks (coding, tool use, analysis); Haiku 4.5 for 5-20% (classification, routing, extraction, summarization, validation); Opus 4.6 for 5-10% (architecture decisions, large-scale refactoring, 1M context analysis, high-stakes review).

A production model router:

MODEL_MAP = {
    "route":      "claude-haiku-4-5-20250514",     # $1/$5
    "classify":   "claude-haiku-4-5-20250514",
    "extract":    "claude-haiku-4-5-20250514",
    "validate":   "claude-haiku-4-5-20250514",
    "summarize":  "claude-haiku-4-5-20250514",
    "code":       "claude-sonnet-4-6-20260205",     # $3/$15
    "analyze":    "claude-sonnet-4-6-20260205",
    "agent_loop": "claude-sonnet-4-6-20260205",
    "architect":  "claude-opus-4-6-20260205",       # $5/$25
    "refactor":   "claude-opus-4-6-20260205",
    "orchestrate":"claude-opus-4-6-20260205",
}

Cost optimization levers

Prompt caching: Cache reads cost 10% of standard input (90% discount). Structure conversations for append-only patterns to maximize cache hits. A customer support system processing 100K requests/day saved 37% ($365K/year) by combining routing and caching.
Batch API: 50% discount on all models for async workloads. Ideal for evaluation runs, bulk code review, and CI/CD pipelines.
Tool Search Tool: Reduces tool definition tokens by 85% (from 77K to 8.7K in testing). Essential when running 10+ tools or multiple MCP servers.
Programmatic Tool Calling: Claude writes Python to orchestrate tools in a sandbox, reducing token usage by 37% on complex multi-step workflows.
Effort parameter (Opus 4.5+): Controls reasoning depth per request. Medium effort matches Sonnet 4.5 quality using 76% fewer tokens.
Extended thinking tokens bill as output tokens, no separate tier, but they add up. Start at the 1,024-token minimum budget and increase incrementally.

4. State and knowledge management across agent sessions

Conversation history as primary state

The simplest and most common state mechanism: all messages accumulate in the context window. Every tool result, assistant response, and user message forms an implicit state log. This works until context fills up.

Filesystem as the shared memory layer

Anthropic's production systems converge on files over databases for agent state:

CLAUDE.md: Persistent project context loaded at session start. Stores conventions, architecture decisions, build commands, learned preferences. Use multiple .claude/ files in subdirectories for scoped context.
Subagent output to filesystem: Avoid "game of telephone", subagents write results to files and pass back lightweight references. The orchestrator reads only what it needs.
Git as coordination protocol: Carlini's 16-agent C compiler used a shared bare git repo for synchronization. Agents push/pull; git conflicts force resolution. Progress is tracked via commit history.

<!-- .claude/CLAUDE.md -->
## Build & Test
- `npm run build` compiles TypeScript
- `npm test` runs Jest suite
- Always run tests after editing src/ files

## Architecture
- API routes in src/routes/, services in src/services/
- PostgreSQL via Prisma ORM; migrations in prisma/migrations/
- Auth uses JWT with refresh tokens in HttpOnly cookies

## Conventions
- Prefer async/await over .then() chains
- Error handling: wrap service calls in try/catch, return typed
  Result<T, Error>
- Never commit .env files; use .env.example for templates

Structured state for multi-agent coordination

For explicit state tracking across agent steps, use a structured state object:

from dataclasses import dataclass, field
from typing import Any

@dataclass
class AgentState:
    task_id: str
    plan: list[str] = field(default_factory=list)
    completed_steps: list[str] = field(default_factory=list)
    artifacts: dict[str, Any] = field(default_factory=dict)
    errors: list[str] = field(default_factory=list)

    def to_context(self) -> str:
        """Serialize state for injection into agent prompt."""
        return f"""<agent_state>
Plan: {json.dumps(self.plan)}
Completed: {json.dumps(self.completed_steps)}
Artifacts: {json.dumps(list(self.artifacts.keys()))}
Errors: {json.dumps(self.errors)}
</agent_state>"""

External memory for long-term persistence

For knowledge that must survive across sessions and agents, a hybrid architecture combining vector stores and structured storage performs best:

Vector stores (FAISS, pgvector, ChromaDB) for semantic similarity search, "what was the agent's approach to similar problems?"
Knowledge graphs for structured relationship queries, entity relationships, dependency chains, architecture decisions.
Mem0 reports 91% lower response times than full-context approaches while maintaining quality, by extracting and managing salient information from conversations.

The evolution path is RAG -> Agentic RAG -> Agent Memory. The critical difference: Agent Memory is read-write, the agent stores information for its own future retrieval, not just reads from a static corpus.

5. Context window management is the core engineering challenge

With 200K tokens standard (1M in beta for Opus/Sonnet 4.6), context management determines agent reliability over long sessions. Claude Code's empirical degradation thresholds:

Context utilization	Performance	Action
0-50%	Optimal	No intervention needed
50-75%	Good	Monitor actively
75-90%	Noticeable degradation	Trigger compaction
90%+	Significant issues	Compact or clear immediately

Server-side compaction

Anthropic's server-side compaction API (beta, compact_20260112) handles this automatically: when input tokens exceed a threshold, the API generates a summary, creates a compaction block, and continues with compressed context. Claude Code triggers auto-compaction at ~75-92% capacity, typically achieving 60-80% reduction (a 150K context compacts to 30-50K).

The head+tail strategy

When implementing custom compaction, split the budget: 20% head (task definition, system context) + 80% tail (most recent work). Drop middle messages. This preserves both the original objective and recent progress, the two things agents need most.

Context isolation through subagents

The most elegant context management technique: delegate context-heavy subtasks to subagents. Each subagent runs in a fresh context window, does the heavy lifting (reading files, calling tools, accumulating intermediate results), and returns only a compressed summary. The orchestrator's context grows by that summary size, not the full subtask transcript.

Progressive disclosure over context stuffing

Research on the "lost in the middle" problem (Liu et al., 2023) shows LLMs retrieve information best from the start and end of context, failing on information buried in the middle. Anthropic recommends just-in-time context loading: maintain lightweight identifiers (file paths, stored queries) and load data dynamically via tools, rather than stuffing everything upfront.

<!-- Prompt structure that supports progressive disclosure -->
<critical_context>
  Architecture overview, current task, key constraints
</critical_context>

<available_references>
  - src/auth/jwt.ts: JWT token handling
  - src/middleware/auth.ts: Auth middleware
  - docs/auth-flow.md: Authentication flow documentation
  Use Read tool to load these when needed.
</available_references>

Prompt caching for cost-effective long sessions

Structure conversations to maximize cache hits. Anthropic's caching: 5-min cache writes at 1.25x input price, cache reads at 0.1x input price (90% discount). For agents with stable system prompts and tool definitions, this is often the single largest cost reducer. Manus identified cache hit rate as their most important production metric.

6. Subagents vs parallel bash-invoked agents

This is a deployment architecture decision with distinct tradeoffs.

Native subagents (tool-call delegation)

Each subagent spawns via a tool call, runs in its own context, and returns only its final response. Configuration via CLAUDE_CODE_SUBAGENT_MODEL allows per-subagent model selection. Max parallelism: configurable, default 3, max 8 in Claude Code.

Best for: session-bounded work, context compression, real-time delegation within a conversation, when the orchestrator needs to reason about results before proceeding.

Parallel bash/process agents

Fully independent OS processes, each Claude session runs in its own container or terminal. Communication via filesystem and git. Nicholas Carlini's 16-parallel-agent C compiler demonstrates the pattern:

#!/bin/bash
# Each agent in a separate Docker container
while true; do
    COMMIT=$(git rev-parse --short=6 HEAD)
    claude --dangerously-skip-permissions \
           -p "$(cat AGENT_PROMPT.md)" \
           --model claude-opus-4-6 &> "agent_logs/agent_${COMMIT}.log"
done

Synchronization via a shared bare git repo. No orchestration agent, each agent independently picks the "next most obvious problem." Cost for the full compiler project: ~$20,000 across ~2,000 sessions (2B input tokens, 140M output tokens).

Best for: long-running autonomous work exceeding single-session limits, CI/CD integration, tasks needing complete independence, scaling beyond API concurrency limits.

Dimension	Native subagents	Parallel bash agents
Coordination	Built-in; parent delegates and receives results	Manual; file locks, git merges
Failure handling	Error returns as tool result; parent retries	Process crash independent; loop auto-restarts
Context sharing	Parent passes context via prompt string	Shared via filesystem/git only
Max parallelism	3-8 per session	Limited by API rate limits and infra
Cost control	Per-subagent model routing	Same API costs + infrastructure overhead
Kubernetes fit	Single pod with async API calls	Job/CronJob per agent; natural K8s fit

For Kubernetes deployments, the bash-process model maps naturally to K8s Jobs: each agent is a container with the repo mounted via a PersistentVolume, coordinating through a shared git repo or Redis-backed task queue.

7. Tool use and MCP integration

Tool definition best practices

Tool descriptions are "by far the most important factor in tool performance" according to Anthropic's documentation. Each description should be 3-4+ sentences explaining what the tool does, when to use it, what each parameter means, and what it does not do:

{
  "name": "search_codebase",
  "description": "Search the codebase for files matching a pattern or
    containing specific text. Use this for finding implementations,
    references, and understanding code structure. Returns file paths
    and matching lines. Does NOT modify files or execute code. Prefer
    this over Bash grep for structured searches.",
  "input_schema": {
    "type": "object",
    "properties": {
      "query": {
        "type": "string",
        "description": "Search term or regex pattern"
      },
      "file_pattern": {
        "type": "string",
        "description": "Glob pattern to filter files, e.g. '*.ts' "
                        "or 'src/**/*.py'"
      },
      "max_results": {
        "type": "integer",
        "description": "Maximum results to return. Default 20.",
        "default": 20
      }
    },
    "required": ["query"]
  },
  "input_examples": [
    {"query": "handleAuth", "file_pattern": "src/**/*.ts"},
    {"query": "TODO|FIXME", "max_results": 50}
  ],
  "strict": true
}

Key rules: use strict: true in production for guaranteed schema validation; use input_examples for complex parameter combinations; consolidate related operations into fewer tools with an action parameter rather than many narrow tools.

Managing tool sprawl with Tool Search

At 10+ tools, definitions consume so many context tokens that they degrade performance. Anthropic observed 134K tokens consumed by tool definitions alone before optimization. The Tool Search Tool solves this:

response = client.beta.messages.create(
    betas=["advanced-tool-use-2025-11-20"],
    model="claude-sonnet-4-6",
    max_tokens=4096,
    tools=[
        {"type": "tool_search_tool_regex_20251119",
         "name": "tool_search_tool_regex"},
        # Your tools with defer_loading
        {"name": "complex_tool", "description": "...",
         "input_schema": {...}, "defer_loading": True},
    ],
    messages=[...],
)

Results: 85% reduction in token usage; Opus 4 accuracy improved from 49% to 74% on large tool libraries. For agents with multiple MCP servers, this is essential.

MCP integration

MCP (Model Context Protocol), now under the Linux Foundation, has 5,800+ servers and 97M+ monthly SDK downloads. The MCP connector lets you connect remote servers directly from the Messages API:

response = client.beta.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4096,
    mcp_servers=[
        {"type": "url",
         "url": "https://mcp.example.com/sse",
         "name": "project-tools",
         "authorization_token": "..."}
    ],
    tools=[{"type": "mcp_toolset",
            "mcp_server_name": "project-tools"}],
    messages=[...],
    extra_headers={"anthropic-beta": "mcp-client-2025-11-20"},
)

For local/CLI deployments, MCP servers run over stdio transport. For Kubernetes, use HTTP/SSE transport with MCP servers as sidecar containers or cluster services.

8. Prompting patterns that drive agent reliability

The "right altitude" principle

Anthropic's context engineering guidance identifies a Goldilocks zone between over-specified (brittle if-else logic in prompts) and under-specified (vague instructions assuming shared context). The goal is heuristics over hardcoded rules, specific enough to guide behavior, flexible enough to handle edge cases.

System prompt structure

<role>
You are a senior software engineer working on [project]. You write
clean, tested, production-quality code following the project conventions
below.
</role>

<conventions>
- TypeScript with strict mode; prefer async/await
- Error handling: typed Result<T, Error> pattern
- Tests required for all new public functions
</conventions>

<tool_guidance>
- Use search_codebase before modifying unfamiliar code
- Run tests after every file edit
- Use Read for specific files; use Grep for pattern search across the
  codebase
- NEVER use Bash for file operations (mkdir, rm, cp) - use Write/Edit
  tools
</tool_guidance>

<constraints>
- Do not modify files outside src/ without explicit approval
- If a change requires modifying more than 5 files, present a plan first
- If tests fail after 3 attempts, stop and report the issue
</constraints>

<output_format>
After completing work, provide: (1) summary of changes, (2) files
modified, (3) test results, (4) any remaining concerns.
</output_format>

Extended and adaptive thinking

Claude 4.6 models support adaptive thinking, the model dynamically calibrates reasoning depth based on task complexity. For agents, this eliminates the need to manually tune thinking budgets:

response = client.messages.create(
    model="claude-opus-4-6",
    max_tokens=16384,
    thinking={"type": "adaptive"},
    messages=[...],
    tools=[...],
)

Interleaved thinking (beta header: interleaved-thinking-2025-05-14) enables thinking between tool calls, Claude reasons about each tool result before deciding the next step. This significantly improves multi-step agent performance.

Additionally, the think tool (distinct from extended thinking) gives agents an explicit scratchpad during response generation:

{
  "name": "think",
  "description": "Use this to reason about information before acting. "
                 "Does not obtain new information or change state.",
  "input_schema": {
    "type": "object",
    "properties": {
      "thought": {
        "type": "string",
        "description": "A thought to reason about."
      }
    },
    "required": ["thought"]
  }
}

This is especially valuable in policy-heavy environments and long tool-call chains where the agent needs to pause and assess before proceeding.

Delegation prompts for multi-agent systems

Each subagent needs four things from its parent: an objective, an output format, guidance on tools/sources, and clear task boundaries. Anthropic found that without detailed delegation prompts, agents duplicate work or leave gaps. Include explicit scaling rules:

## Scaling rules
- Simple fact-finding: 1 agent, 3-10 tool calls
- Direct comparisons: 2-4 subagents, 10-15 calls each
- Complex research: 10+ subagents, divided responsibilities
- Start with short, broad queries; progressively narrow focus

9. Evaluation and reliability engineering

Failure modes are specification problems, not infrastructure problems

Berkeley's MASFT taxonomy (2025), analyzing 150+ tasks across 5 multi-agent frameworks, found that 79% of failures originate from specification and system design issues, not from infrastructure or model limitations. The three failure categories:

Specification failures (~79%): disobeying task specs, ambiguous instructions, improper task decomposition, missing termination cues
Inter-agent misalignment: role drift, duplicated effort, incompatible output formats
Task verification failures: agents never calling "done," premature termination, insufficient validation

Multi-agent systems show 41-86.7% failure rates in production. One documented case: a circular agent-to-agent message relay persisted for 9 days, consuming 60,000+ tokens.

External loop guardrails are non-negotiable

The system running the agent, not the agent itself, must guarantee termination. Key guardrails:

MAX_ITERATIONS = 25
MAX_TOKENS_BUDGET = 500_000
REPETITION_THRESHOLD = 3  # Same tool call 3x = abort

def guarded_agent_loop(task: str) -> AgentResult:
    total_tokens = 0
    recent_actions = []

    for i in range(MAX_ITERATIONS):
        response = run_agent_step(...)
        total_tokens += (response.usage.input_tokens +
                         response.usage.output_tokens)

        if total_tokens > MAX_TOKENS_BUDGET:
            return AgentResult(status="budget_exceeded",
                               partial=collect_results())

        action = extract_action(response)
        recent_actions.append(action)
        if (recent_actions[-REPETITION_THRESHOLD:].count(action)
                == REPETITION_THRESHOLD):
            return AgentResult(status="loop_detected",
                               partial=collect_results())

    return AgentResult(status="max_iterations",
                       partial=collect_results())

Evaluation-driven development

Anthropic recommends starting with 20-50 test cases drawn from real production failures, then scaling:

Code-based graders (fast, cheap, objective): do tests pass? Does the diff apply cleanly?
LLM-as-judge (flexible, scalable): single LLM call scoring against a rubric (0.0-1.0). Shows 0.7-0.9 Spearman correlation with human preferences. Known biases: length preference, self-model bias.
Human evaluation (gold standard): catches edge cases that automation misses, hallucinations on unusual queries, subtle source selection biases.

The evaluator-optimizer loop is the most reliable pattern for iterative quality improvement: Generator produces output -> Evaluator scores against criteria -> if FAIL, feedback loops back. An independent validation agent should have isolated prompts, separate context, and independent scoring criteria, if it shares too much context with the producer, it becomes "another participant in collective delusion."

Key benchmarks for coding agents

Benchmark	What it tests	Current state of the art
SWE-bench Verified	500 human-validated GitHub issues	Opus 4.6 (thinking): 79.2%
SWE-bench Pro	Harder multi-file enterprise tasks	Auggie (Opus 4.5): 51.8%
Terminal-Bench	CLI/terminal agent capabilities	Emerging standard
tau-benchmark	Consistency via pass^k metric	Retail/airline booking domains

The most actionable metric for production systems is pass@k, the probability of at least one correct solution in k attempts. Teams report that adding a review subagent yields only ~0.5% gain on SWE-bench pass@3, but significantly more impact on real-world code quality where benchmarks don't capture the full picture.

Conclusion: the architecture that actually ships

The highest-performing Claude agent systems in production share five principles that cut against the instinct to over-engineer:

First, context engineering dominates architecture. What enters the context window, when it's loaded, and how it's compressed matters more than the number of agents or the orchestration framework. Invest in CLAUDE.md files, progressive disclosure, prompt caching, and compaction strategies before adding multi-agent complexity.

Second, the right model for the right task is a 3x cost lever. Haiku 4.5 at $1/$5 handles routing, classification, and validation within 2-5% of Sonnet accuracy. Sonnet 4.6 at $3/$15 handles 80-90% of coding work at near-Opus quality. Opus 4.6 is reserved for architecture-level reasoning and 1M-context analysis. A production model router implementing this split saved one team $365K/year.

Third, tool design requires the same rigor as prompt engineering. Descriptions of 3-4+ sentences, strict: true for schema validation, input_examples for complex parameters, and the Tool Search Tool for 10+ tool libraries are all table stakes for production agents.

Fourth, specification quality drives reliability more than infrastructure. With 79% of multi-agent failures traced to specification issues, the most impactful investment is in clear task descriptions, explicit scaling rules, termination conditions, and tool scoping, not in framework selection or infrastructure sophistication.

Fifth, start with raw SDK, add layers only when measured outcomes improve. Anthropic's own recommendation, and the pattern their most successful customers follow, is to begin with the 50-line agent loop, add the Agent SDK when you need built-in tools, and adopt a framework only when you need stateful graph-based routing or built-in observability. The complexity you avoid is the complexity you'll never need to debug at 3 AM.

Kyle Pericak