The most effective Claude agent systems in 2025-2026 share a counterintuitive trait: radical simplicity. Anthropic's own engineering teams, after building production multi-agent systems, consistently find that a single-threaded tool-call loop, gather context -> act -> verify -> repeat, outperforms elaborate multi-agent architectures for most coding tasks. The key differentiator is not architectural complexity but context engineering: what information enters the context window, when, and how it's managed over long-running sessions. This report synthesizes Anthropic's published guidance, their internal engineering practices (including Claude Code's architecture), and practitioner patterns into actionable blueprints for building production-grade Claude agent systems on raw SDK.
The foundational building block for every Claude agent system is the augmented LLM, a model enhanced with retrieval, tools, and memory. Anthropic draws a hard line between workflows (predefined code paths orchestrating LLM calls) and agents (LLMs dynamically directing their own processes). Their recommendation: exhaust workflow patterns before building autonomous agents.
Claude Code's internal architecture, extensively reverse-engineered by the community and confirmed by Anthropic's engineering posts, reveals a single main thread with a flat message list:
import anthropic, json
client = anthropic.Anthropic()
def agent_loop(user_message: str, system_prompt: str, tools: list,
tool_executors: dict, max_iterations: int = 25) -> str:
messages = [{"role": "user", "content": user_message}]
for _ in range(max_iterations):
response = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=16384,
system=system_prompt,
tools=tools,
messages=messages,
)
messages.append({"role": "assistant", "content": response.content})
tool_calls = [b for b in response.content if b.type == "tool_use"]
if not tool_calls:
return next(b.text for b in response.content
if hasattr(b, "text"))
tool_results = []
for tc in tool_calls:
try:
result = tool_executors[tc.name](tc.input)
tool_results.append({"type": "tool_result",
"tool_use_id": tc.id,
"content": result})
except Exception as e:
tool_results.append({"type": "tool_result",
"tool_use_id": tc.id,
"content": str(e),
"is_error": True})
messages.append({"role": "user", "content": tool_results})
return "Max iterations reached."
This pattern, 50 lines handling 80% of agent use cases, is what
Anthropic recommends as a starting point. The loop terminates when Claude
produces a text response with no tool calls (stop_reason: "end_turn").
Claude can return multiple tool calls simultaneously for parallel
execution.
Anthropic's January 2026 guidance identifies exactly three scenarios where multi-agent consistently outperforms single-agent:
The cost reality is stark: agents use ~4x more tokens than chat; multi-agent systems use ~15x more. Token usage alone explains 80% of performance variance in Anthropic's research evaluations. Multi-agent is an economic decision, reserve it for tasks where the value per completion justifies the cost.
Claude Code enforces a single-level subagent hierarchy: the main agent can spawn subagents, but subagents cannot spawn their own. This prevents recursive explosion. Custom subagents are defined as Markdown files with YAML frontmatter:
---
name: code-reviewer
description: Reviews changes for bugs, regressions, and missing tests.
MUST BE USED for any PR review or code audit task.
model: haiku
tools:
- Read
- Grep
- Glob
---
You are a code review specialist. Analyze changes for:
- Potential bugs and regressions
- Missing test coverage
- Style consistency with project conventions
Output a structured review with severity ratings.
The description field drives auto-delegation, Claude decides whether to
invoke a subagent based on this text. Action-oriented descriptions with
trigger phrases ("MUST BE USED," "PROACTIVELY invoke for") improve
delegation reliability. The model field enables per-subagent model
routing: Haiku for read-heavy review, Sonnet for implementation, Opus
for architecture decisions.
Anthropic's December 2024 "Building Effective Agents" post defines five workflow patterns as building blocks. These compose into more complex systems.
| Pattern | Mechanism | Best for | Claude-specific notes |
|---|---|---|---|
| Prompt chaining | Sequential LLM calls; output of step N feeds step N+1 | Tasks with clear sequential subtasks and validation gates | Add "gate" checks between steps to catch errors early |
| Routing | Classifier directs input to specialized handlers | Distinct categories needing different tools/prompts | Use Haiku as the router (within 2-5% of Sonnet accuracy, 73% cheaper) |
| Parallelization | Simultaneous LLM calls, sectioning (split task) or voting (same task, multiple attempts) | Independent subtasks; content moderation; diverse solution generation | Claude supports parallel tool calls natively; asyncio.gather() for fan-out |
| Orchestrator-worker | Central LLM dynamically decomposes and delegates | Complex tasks where subtasks aren't known upfront | Primary pattern for Claude Research; lead agent needs detailed delegation prompts |
| Evaluator-optimizer | Generator + evaluator in iterative loop | Tasks with clear quality criteria and measurable refinement | PwC reports 7x accuracy improvement (10% to 70%) through structured validation loops |
Anthropic explicitly recommends raw API/SDK first: "Many patterns can be implemented in a few lines of code. If you do use a framework, ensure you understand the underlying code. Incorrect assumptions about what's under the hood are a common source of customer error."
For the orchestrator-worker pattern with the raw SDK:
import asyncio
from anthropic import AsyncAnthropic
client = AsyncAnthropic()
async def orchestrate(query: str, subtasks: list[str]) -> str:
# Fan-out: parallel subagent calls
async def run_subagent(task: str) -> str:
response = await client.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
system="You are a focused research agent. Complete the "
"assigned task thoroughly.",
messages=[{"role": "user", "content": task}],
tools=[web_search_tool, file_read_tool],
)
return extract_text(response)
results = await asyncio.gather(
*[run_subagent(t) for t in subtasks]
)
# Synthesize: lead agent merges findings
synthesis = await client.messages.create(
model="claude-opus-4-6",
max_tokens=8192,
system="Synthesize these research findings into a coherent "
"analysis.",
messages=[{"role": "user",
"content": f"Query: {query}\n\nFindings:\n" +
"\n---\n".join(results)}],
)
return extract_text(synthesis)
The Claude Agent SDK (renamed from Claude Code SDK, September 2025) provides the same agent loop with built-in tools (Read, Write, Edit, Bash, Glob, Grep) and automatic compaction. It's a middle ground between raw API and full frameworks:
from claude_agent_sdk import query, ClaudeAgentOptions
async for message in query(
prompt="Fix the authentication bug in auth.py",
options=ClaudeAgentOptions(
allowed_tools=["Read", "Edit", "Bash", "Grep"],
model="claude-sonnet-4-6",
),
):
print(message)
The current Claude lineup as of March 2026 creates a clear cost-performance hierarchy for agent systems:
| Model | Input $/MTok | Output $/MTok | Context | Latency | Key capability |
|---|---|---|---|---|---|
| Opus 4.6 | $5 | $25 | 200K (1M beta) | Slowest (fast mode: $30/$150) | Deepest reasoning, adaptive thinking, agent teams |
| Sonnet 4.6 | $3 | $15 | 200K (1M beta) | Responsive | Near-Opus quality (OSWorld 72.5% vs 72.7%), adaptive thinking |
| Haiku 4.5 | $1 | $5 | 200K | Sub-second | 90% of Sonnet performance at 1/3 cost |
The practical allocation: Sonnet 4.6 for 80-90% of agent tasks (coding, tool use, analysis); Haiku 4.5 for 5-20% (classification, routing, extraction, summarization, validation); Opus 4.6 for 5-10% (architecture decisions, large-scale refactoring, 1M context analysis, high-stakes review).
A production model router:
MODEL_MAP = {
"route": "claude-haiku-4-5-20250514", # $1/$5
"classify": "claude-haiku-4-5-20250514",
"extract": "claude-haiku-4-5-20250514",
"validate": "claude-haiku-4-5-20250514",
"summarize": "claude-haiku-4-5-20250514",
"code": "claude-sonnet-4-6-20260205", # $3/$15
"analyze": "claude-sonnet-4-6-20260205",
"agent_loop": "claude-sonnet-4-6-20260205",
"architect": "claude-opus-4-6-20260205", # $5/$25
"refactor": "claude-opus-4-6-20260205",
"orchestrate":"claude-opus-4-6-20260205",
}
The simplest and most common state mechanism: all messages accumulate in the context window. Every tool result, assistant response, and user message forms an implicit state log. This works until context fills up.
Anthropic's production systems converge on files over databases for agent state:
.claude/ files in subdirectories for
scoped context.<!-- .claude/CLAUDE.md -->
## Build & Test
- `npm run build` compiles TypeScript
- `npm test` runs Jest suite
- Always run tests after editing src/ files
## Architecture
- API routes in src/routes/, services in src/services/
- PostgreSQL via Prisma ORM; migrations in prisma/migrations/
- Auth uses JWT with refresh tokens in HttpOnly cookies
## Conventions
- Prefer async/await over .then() chains
- Error handling: wrap service calls in try/catch, return typed
Result<T, Error>
- Never commit .env files; use .env.example for templates
For explicit state tracking across agent steps, use a structured state object:
from dataclasses import dataclass, field
from typing import Any
@dataclass
class AgentState:
task_id: str
plan: list[str] = field(default_factory=list)
completed_steps: list[str] = field(default_factory=list)
artifacts: dict[str, Any] = field(default_factory=dict)
errors: list[str] = field(default_factory=list)
def to_context(self) -> str:
"""Serialize state for injection into agent prompt."""
return f"""<agent_state>
Plan: {json.dumps(self.plan)}
Completed: {json.dumps(self.completed_steps)}
Artifacts: {json.dumps(list(self.artifacts.keys()))}
Errors: {json.dumps(self.errors)}
</agent_state>"""
For knowledge that must survive across sessions and agents, a hybrid architecture combining vector stores and structured storage performs best:
The evolution path is RAG -> Agentic RAG -> Agent Memory. The critical difference: Agent Memory is read-write, the agent stores information for its own future retrieval, not just reads from a static corpus.
With 200K tokens standard (1M in beta for Opus/Sonnet 4.6), context management determines agent reliability over long sessions. Claude Code's empirical degradation thresholds:
| Context utilization | Performance | Action |
|---|---|---|
| 0-50% | Optimal | No intervention needed |
| 50-75% | Good | Monitor actively |
| 75-90% | Noticeable degradation | Trigger compaction |
| 90%+ | Significant issues | Compact or clear immediately |
Anthropic's server-side compaction API (beta, compact_20260112)
handles this automatically: when input tokens exceed a threshold, the
API generates a summary, creates a compaction block, and continues with
compressed context. Claude Code triggers auto-compaction at ~75-92%
capacity, typically achieving 60-80% reduction (a 150K context
compacts to 30-50K).
When implementing custom compaction, split the budget: 20% head (task definition, system context) + 80% tail (most recent work). Drop middle messages. This preserves both the original objective and recent progress, the two things agents need most.
The most elegant context management technique: delegate context-heavy subtasks to subagents. Each subagent runs in a fresh context window, does the heavy lifting (reading files, calling tools, accumulating intermediate results), and returns only a compressed summary. The orchestrator's context grows by that summary size, not the full subtask transcript.
Research on the "lost in the middle" problem (Liu et al., 2023) shows LLMs retrieve information best from the start and end of context, failing on information buried in the middle. Anthropic recommends just-in-time context loading: maintain lightweight identifiers (file paths, stored queries) and load data dynamically via tools, rather than stuffing everything upfront.
<!-- Prompt structure that supports progressive disclosure -->
<critical_context>
Architecture overview, current task, key constraints
</critical_context>
<available_references>
- src/auth/jwt.ts: JWT token handling
- src/middleware/auth.ts: Auth middleware
- docs/auth-flow.md: Authentication flow documentation
Use Read tool to load these when needed.
</available_references>
Structure conversations to maximize cache hits. Anthropic's caching: 5-min cache writes at 1.25x input price, cache reads at 0.1x input price (90% discount). For agents with stable system prompts and tool definitions, this is often the single largest cost reducer. Manus identified cache hit rate as their most important production metric.
This is a deployment architecture decision with distinct tradeoffs.
Each subagent spawns via a tool call, runs in its own context, and
returns only its final response. Configuration via
CLAUDE_CODE_SUBAGENT_MODEL allows per-subagent model selection.
Max parallelism: configurable, default 3, max 8 in Claude Code.
Best for: session-bounded work, context compression, real-time delegation within a conversation, when the orchestrator needs to reason about results before proceeding.
Fully independent OS processes, each Claude session runs in its own container or terminal. Communication via filesystem and git. Nicholas Carlini's 16-parallel-agent C compiler demonstrates the pattern:
#!/bin/bash
# Each agent in a separate Docker container
while true; do
COMMIT=$(git rev-parse --short=6 HEAD)
claude --dangerously-skip-permissions \
-p "$(cat AGENT_PROMPT.md)" \
--model claude-opus-4-6 &> "agent_logs/agent_${COMMIT}.log"
done
Synchronization via a shared bare git repo. No orchestration agent, each agent independently picks the "next most obvious problem." Cost for the full compiler project: ~$20,000 across ~2,000 sessions (2B input tokens, 140M output tokens).
Best for: long-running autonomous work exceeding single-session limits, CI/CD integration, tasks needing complete independence, scaling beyond API concurrency limits.
| Dimension | Native subagents | Parallel bash agents |
|---|---|---|
| Coordination | Built-in; parent delegates and receives results | Manual; file locks, git merges |
| Failure handling | Error returns as tool result; parent retries | Process crash independent; loop auto-restarts |
| Context sharing | Parent passes context via prompt string | Shared via filesystem/git only |
| Max parallelism | 3-8 per session | Limited by API rate limits and infra |
| Cost control | Per-subagent model routing | Same API costs + infrastructure overhead |
| Kubernetes fit | Single pod with async API calls | Job/CronJob per agent; natural K8s fit |
For Kubernetes deployments, the bash-process model maps naturally to K8s Jobs: each agent is a container with the repo mounted via a PersistentVolume, coordinating through a shared git repo or Redis-backed task queue.
Tool descriptions are "by far the most important factor in tool performance" according to Anthropic's documentation. Each description should be 3-4+ sentences explaining what the tool does, when to use it, what each parameter means, and what it does not do:
{
"name": "search_codebase",
"description": "Search the codebase for files matching a pattern or
containing specific text. Use this for finding implementations,
references, and understanding code structure. Returns file paths
and matching lines. Does NOT modify files or execute code. Prefer
this over Bash grep for structured searches.",
"input_schema": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search term or regex pattern"
},
"file_pattern": {
"type": "string",
"description": "Glob pattern to filter files, e.g. '*.ts' "
"or 'src/**/*.py'"
},
"max_results": {
"type": "integer",
"description": "Maximum results to return. Default 20.",
"default": 20
}
},
"required": ["query"]
},
"input_examples": [
{"query": "handleAuth", "file_pattern": "src/**/*.ts"},
{"query": "TODO|FIXME", "max_results": 50}
],
"strict": true
}
Key rules: use strict: true in production for guaranteed schema
validation; use input_examples for complex parameter combinations;
consolidate related operations into fewer tools with an action
parameter rather than many narrow tools.
At 10+ tools, definitions consume so many context tokens that they degrade performance. Anthropic observed 134K tokens consumed by tool definitions alone before optimization. The Tool Search Tool solves this:
response = client.beta.messages.create(
betas=["advanced-tool-use-2025-11-20"],
model="claude-sonnet-4-6",
max_tokens=4096,
tools=[
{"type": "tool_search_tool_regex_20251119",
"name": "tool_search_tool_regex"},
# Your tools with defer_loading
{"name": "complex_tool", "description": "...",
"input_schema": {...}, "defer_loading": True},
],
messages=[...],
)
Results: 85% reduction in token usage; Opus 4 accuracy improved from 49% to 74% on large tool libraries. For agents with multiple MCP servers, this is essential.
MCP (Model Context Protocol), now under the Linux Foundation, has 5,800+ servers and 97M+ monthly SDK downloads. The MCP connector lets you connect remote servers directly from the Messages API:
response = client.beta.messages.create(
model="claude-sonnet-4-6",
max_tokens=4096,
mcp_servers=[
{"type": "url",
"url": "https://mcp.example.com/sse",
"name": "project-tools",
"authorization_token": "..."}
],
tools=[{"type": "mcp_toolset",
"mcp_server_name": "project-tools"}],
messages=[...],
extra_headers={"anthropic-beta": "mcp-client-2025-11-20"},
)
For local/CLI deployments, MCP servers run over stdio transport. For Kubernetes, use HTTP/SSE transport with MCP servers as sidecar containers or cluster services.
Anthropic's context engineering guidance identifies a Goldilocks zone between over-specified (brittle if-else logic in prompts) and under-specified (vague instructions assuming shared context). The goal is heuristics over hardcoded rules, specific enough to guide behavior, flexible enough to handle edge cases.
<role>
You are a senior software engineer working on [project]. You write
clean, tested, production-quality code following the project conventions
below.
</role>
<conventions>
- TypeScript with strict mode; prefer async/await
- Error handling: typed Result<T, Error> pattern
- Tests required for all new public functions
</conventions>
<tool_guidance>
- Use search_codebase before modifying unfamiliar code
- Run tests after every file edit
- Use Read for specific files; use Grep for pattern search across the
codebase
- NEVER use Bash for file operations (mkdir, rm, cp) - use Write/Edit
tools
</tool_guidance>
<constraints>
- Do not modify files outside src/ without explicit approval
- If a change requires modifying more than 5 files, present a plan first
- If tests fail after 3 attempts, stop and report the issue
</constraints>
<output_format>
After completing work, provide: (1) summary of changes, (2) files
modified, (3) test results, (4) any remaining concerns.
</output_format>
Claude 4.6 models support adaptive thinking, the model dynamically calibrates reasoning depth based on task complexity. For agents, this eliminates the need to manually tune thinking budgets:
response = client.messages.create(
model="claude-opus-4-6",
max_tokens=16384,
thinking={"type": "adaptive"},
messages=[...],
tools=[...],
)
Interleaved thinking (beta header: interleaved-thinking-2025-05-14)
enables thinking between tool calls, Claude reasons about each tool
result before deciding the next step. This significantly improves
multi-step agent performance.
Additionally, the think tool (distinct from extended thinking) gives agents an explicit scratchpad during response generation:
{
"name": "think",
"description": "Use this to reason about information before acting. "
"Does not obtain new information or change state.",
"input_schema": {
"type": "object",
"properties": {
"thought": {
"type": "string",
"description": "A thought to reason about."
}
},
"required": ["thought"]
}
}
This is especially valuable in policy-heavy environments and long tool-call chains where the agent needs to pause and assess before proceeding.
Each subagent needs four things from its parent: an objective, an output format, guidance on tools/sources, and clear task boundaries. Anthropic found that without detailed delegation prompts, agents duplicate work or leave gaps. Include explicit scaling rules:
## Scaling rules
- Simple fact-finding: 1 agent, 3-10 tool calls
- Direct comparisons: 2-4 subagents, 10-15 calls each
- Complex research: 10+ subagents, divided responsibilities
- Start with short, broad queries; progressively narrow focus
Berkeley's MASFT taxonomy (2025), analyzing 150+ tasks across 5 multi-agent frameworks, found that 79% of failures originate from specification and system design issues, not from infrastructure or model limitations. The three failure categories:
Multi-agent systems show 41-86.7% failure rates in production. One documented case: a circular agent-to-agent message relay persisted for 9 days, consuming 60,000+ tokens.
The system running the agent, not the agent itself, must guarantee termination. Key guardrails:
MAX_ITERATIONS = 25
MAX_TOKENS_BUDGET = 500_000
REPETITION_THRESHOLD = 3 # Same tool call 3x = abort
def guarded_agent_loop(task: str) -> AgentResult:
total_tokens = 0
recent_actions = []
for i in range(MAX_ITERATIONS):
response = run_agent_step(...)
total_tokens += (response.usage.input_tokens +
response.usage.output_tokens)
if total_tokens > MAX_TOKENS_BUDGET:
return AgentResult(status="budget_exceeded",
partial=collect_results())
action = extract_action(response)
recent_actions.append(action)
if (recent_actions[-REPETITION_THRESHOLD:].count(action)
== REPETITION_THRESHOLD):
return AgentResult(status="loop_detected",
partial=collect_results())
return AgentResult(status="max_iterations",
partial=collect_results())
Anthropic recommends starting with 20-50 test cases drawn from real production failures, then scaling:
The evaluator-optimizer loop is the most reliable pattern for iterative quality improvement: Generator produces output -> Evaluator scores against criteria -> if FAIL, feedback loops back. An independent validation agent should have isolated prompts, separate context, and independent scoring criteria, if it shares too much context with the producer, it becomes "another participant in collective delusion."
| Benchmark | What it tests | Current state of the art |
|---|---|---|
| SWE-bench Verified | 500 human-validated GitHub issues | Opus 4.6 (thinking): 79.2% |
| SWE-bench Pro | Harder multi-file enterprise tasks | Auggie (Opus 4.5): 51.8% |
| Terminal-Bench | CLI/terminal agent capabilities | Emerging standard |
| tau-benchmark | Consistency via pass^k metric | Retail/airline booking domains |
The most actionable metric for production systems is pass@k, the probability of at least one correct solution in k attempts. Teams report that adding a review subagent yields only ~0.5% gain on SWE-bench pass@3, but significantly more impact on real-world code quality where benchmarks don't capture the full picture.
The highest-performing Claude agent systems in production share five principles that cut against the instinct to over-engineer:
First, context engineering dominates architecture. What enters the context window, when it's loaded, and how it's compressed matters more than the number of agents or the orchestration framework. Invest in CLAUDE.md files, progressive disclosure, prompt caching, and compaction strategies before adding multi-agent complexity.
Second, the right model for the right task is a 3x cost lever. Haiku 4.5 at $1/$5 handles routing, classification, and validation within 2-5% of Sonnet accuracy. Sonnet 4.6 at $3/$15 handles 80-90% of coding work at near-Opus quality. Opus 4.6 is reserved for architecture-level reasoning and 1M-context analysis. A production model router implementing this split saved one team $365K/year.
Third, tool design requires the same rigor as prompt engineering.
Descriptions of 3-4+ sentences, strict: true for schema validation,
input_examples for complex parameters, and the Tool Search Tool for
10+ tool libraries are all table stakes for production agents.
Fourth, specification quality drives reliability more than infrastructure. With 79% of multi-agent failures traced to specification issues, the most impactful investment is in clear task descriptions, explicit scaling rules, termination conditions, and tool scoping, not in framework selection or infrastructure sophistication.
Fifth, start with raw SDK, add layers only when measured outcomes improve. Anthropic's own recommendation, and the pattern their most successful customers follow, is to begin with the 50-line agent loop, add the Agent SDK when you need built-in tools, and adopt a framework only when you need stateful graph-based routing or built-in observability. The complexity you avoid is the complexity you'll never need to debug at 3 AM.