Anthropic Guidance on Building Agents

Bullet-point findings extracted from 7 Anthropic publications. These are first-party recommendations from the team that builds Claude, not third-party research. Used as inputs alongside Deep Research reports for agent definition improvements.

Building Effective Agents

Source: anthropic.com/research/building-effective-agents

Start with the simplest pattern possible. Add complexity only when measurable improvements justify the cost.
Five composable workflow patterns: chaining, routing, parallelization, orchestrator-worker, evaluator-optimizer.
Orchestrator-worker fits coding tasks best because subtasks can't be predefined. A central LLM dynamically decomposes and delegates.
Evaluator-optimizer: generator + evaluator in an iterative loop. PwC reports 7x accuracy improvement (10% to 70%).
Framework abstraction is an anti-pattern. Many patterns need only a few lines of direct API code. Frameworks obscure prompts and make debugging harder.
Tool design is a primary failure mode. SWE-bench teams found switching from relative to absolute file paths eliminated a whole class of errors.
Treat tool documentation with the same rigor as system prompts: parameter descriptions, example usage, edge cases, and explicit boundaries between similar tools.
Code is well-suited for agents because automated tests provide objective verification.

Effective Harnesses for Long-Running Agents

Source: anthropic.com/engineering/effective-harnesses-for-long-running-agents

Long-running agents lose context across sessions. Each new window starts with no memory of prior work.
Two-agent architecture: an Initializer creates environment scripts and progress files; a Coding agent reads those artifacts and works one feature at a time.
Agents try to complete everything at once and exhaust context mid-feature. Constraining to one feature per session prevents this.
Use JSON (not Markdown) for feature lists and test status to prevent agents from misinterpreting or corrupting them.
Start each session with verification: read progress, select next feature, run integration test before writing new code.
Explicit instruction: never remove or edit tests. Agents will delete failing tests to mark features complete.
Browser automation (Playwright/Puppeteer) dramatically improved bug detection over unit tests alone. Agents without it declared completion on half-working features.
Git commit history provides verifiable work logs and checkpoints restorable between sessions.

Claude Code Best Practices

Source: code.claude.com/docs/en/best-practices (Note: original URL redirected; findings from live equivalent pages)

Be explicit about actions. "Can you suggest changes?" causes Claude to suggest only. "Change this function" causes it to act.
Parallel tool calling: instruct Claude to call independent tools simultaneously for reduced latency.
If your harness compacts context, tell Claude in the system prompt so it doesn't prematurely wrap up work.
Multi-context window strategy: first window writes tests and setup scripts; subsequent windows iterate on a todo list.
Opus tends to overengineer. Explicitly instruct: "Only make changes directly requested. Keep solutions simple."
Models may hard-code values to pass tests rather than implement correct logic. Prompt: "Write a general-purpose solution."
Instruct Claude to never speculate about code it hasn't read.
Use adaptive thinking for agentic workloads instead of manual thinking budget tuning.
Plan mode (--permission-mode plan) for safe codebase exploration before making changes.
Git worktrees (--worktree) for parallel sessions with isolated branches.

Writing Tools for Agents

Source: anthropic.com/engineering/writing-tools-for-agents

Tools bridge deterministic and non-deterministic systems. Agents may skip tools, hallucinate them, or use them in unexpected ways. Design for this uncertainty.
Fewer, well-designed tools beat many overlapping ones. Consolidate related operations.
Namespace tool names with consistent prefixes to reduce agent confusion.
Return high-signal data: human-readable names over UUIDs.
Allow agents to request output verbosity via a response_format parameter to manage their own context budget.
Treat tool descriptions like onboarding documentation. Make implicit knowledge explicit. Precise description adjustments produced state-of-the-art SWE-bench results.
Measure beyond accuracy: track call volume, redundant calls, error patterns, token consumption.
Error messages must guide better tool use, not return raw tracebacks.
Evaluation-driven development: prototype, test with 50+ realistic tasks grounded in real workflows, analyze reasoning not just accuracy, iterate.

Effective Context Engineering for AI Agents

Source: anthropic.com/engineering/effective-context-engineering-for-ai-agents

Context rot is real. Model accuracy declines as context grows. Treat context as a finite resource regardless of window size.
System prompt altitude: avoid overly rigid (hardcoded conditional logic) and overly vague. Use XML tags or Markdown headers to structure prompts. Start minimal, iterate based on failure modes.
If engineers can't decide which tool applies, agents won't either. Overlapping tools waste tokens and cause confusion.
Just-in-time context loading: store lightweight identifiers (file paths, queries, links) rather than full objects. Agents retrieve context dynamically during execution.
Hybrid retrieval: combine upfront loading (CLAUDE.md at start) with autonomous exploration (grep/glob at runtime).
Compaction: summarize conversation history before limits hit. Maximize recall first, then iterate for precision.
Agentic memory: maintain persistent external files consulted across context resets for multi-hour task progress and architectural decisions.
Sub-agent architecture for context efficiency: specialized agents handle focused tasks with clean windows, return condensed 1,000-2,000 token summaries.
Anti-patterns: assuming LLMs handle long contexts efficiently, pre-loading all potential data upfront, overly aggressive compaction causing information loss.

Multi-Agent Research System

Source: anthropic.com/engineering/multi-agent-research-system

Multi-agent (Opus lead + Sonnet subagents) outperformed single-agent Opus by 90.2% on research tasks. Parallelism reduces research time by up to 90%.
Token economics: multi-agent uses ~15x more tokens than chat; single agents use ~4x. High-value tasks required to justify cost.
Think like your agents: build simulations with exact prompts and tools to observe failure modes step-by-step.
Teach delegation: give subagents clear objectives, output formats, tool guidance, and task boundaries. Vague instructions cause duplicate work and missed information.
Scale effort to complexity: simple fact-finding uses 1 agent with 3-10 calls; complex research uses 10+ subagents.
Start wide, then narrow: begin with broad queries, evaluate, progressively narrow. Mirrors expert human research.
Parallel execution targets: 3-5 subagents in parallel, 3+ parallel tool calls per subagent.
Anti-patterns: spawning excessive subagents for simple queries, overly verbose search queries, pursuing nonexistent information indefinitely.

Measuring Agent Autonomy

Source: anthropic.com/research/measuring-agent-autonomy

Claude Code's longest sessions nearly doubled from under 25 minutes to over 45 minutes between October 2025 and January 2026.
Deployment overhang: METR estimates Claude can handle 5-hour tasks at 50% success. Actual median session is 45 seconds. Users underutilize the model's capability.
New users auto-approve ~20% of the time; experienced users reach 40%+. But experienced users also interrupt more often (5% to 9% of turns), shifting from approving everything to strategic monitoring and intervention.
Agents self-limit more than users do: agents ask for clarification more than twice as often on complex tasks compared to human interruptions.
Mandating exhaustive pre-approval workflows is counterproductive. Experienced users develop better instincts through monitoring.
80% of tool calls have safeguards; 73% involve human oversight; only 0.8% are irreversible actions.

Kyle Pericak