BlogWikiAbout

Kyle Pericak

"It works in my environment"

Bot-Wiki/Research/Anthropic Guidance on Building Agents

Anthropic Guidance on Building Agents

Last verified: 2026-03-15

Bullet-point findings extracted from 7 Anthropic publications. These are first-party recommendations from the team that builds Claude, not third-party research. Used as inputs alongside Deep Research reports for agent definition improvements.

Building Effective Agents

Source: anthropic.com/research/building-effective-agents

  • Start with the simplest pattern possible. Add complexity only when measurable improvements justify the cost.
  • Five composable workflow patterns: chaining, routing, parallelization, orchestrator-worker, evaluator-optimizer.
  • Orchestrator-worker fits coding tasks best because subtasks can't be predefined. A central LLM dynamically decomposes and delegates.
  • Evaluator-optimizer: generator + evaluator in an iterative loop. PwC reports 7x accuracy improvement (10% to 70%).
  • Framework abstraction is an anti-pattern. Many patterns need only a few lines of direct API code. Frameworks obscure prompts and make debugging harder.
  • Tool design is a primary failure mode. SWE-bench teams found switching from relative to absolute file paths eliminated a whole class of errors.
  • Treat tool documentation with the same rigor as system prompts: parameter descriptions, example usage, edge cases, and explicit boundaries between similar tools.
  • Code is well-suited for agents because automated tests provide objective verification.

Effective Harnesses for Long-Running Agents

Source: anthropic.com/engineering/effective-harnesses-for-long-running-agents

  • Long-running agents lose context across sessions. Each new window starts with no memory of prior work.
  • Two-agent architecture: an Initializer creates environment scripts and progress files; a Coding agent reads those artifacts and works one feature at a time.
  • Agents try to complete everything at once and exhaust context mid-feature. Constraining to one feature per session prevents this.
  • Use JSON (not Markdown) for feature lists and test status to prevent agents from misinterpreting or corrupting them.
  • Start each session with verification: read progress, select next feature, run integration test before writing new code.
  • Explicit instruction: never remove or edit tests. Agents will delete failing tests to mark features complete.
  • Browser automation (Playwright/Puppeteer) dramatically improved bug detection over unit tests alone. Agents without it declared completion on half-working features.
  • Git commit history provides verifiable work logs and checkpoints restorable between sessions.

Claude Code Best Practices

Source: code.claude.com/docs/en/best-practices (Note: original URL redirected; findings from live equivalent pages)

  • Be explicit about actions. "Can you suggest changes?" causes Claude to suggest only. "Change this function" causes it to act.
  • Parallel tool calling: instruct Claude to call independent tools simultaneously for reduced latency.
  • If your harness compacts context, tell Claude in the system prompt so it doesn't prematurely wrap up work.
  • Multi-context window strategy: first window writes tests and setup scripts; subsequent windows iterate on a todo list.
  • Opus tends to overengineer. Explicitly instruct: "Only make changes directly requested. Keep solutions simple."
  • Models may hard-code values to pass tests rather than implement correct logic. Prompt: "Write a general-purpose solution."
  • Instruct Claude to never speculate about code it hasn't read.
  • Use adaptive thinking for agentic workloads instead of manual thinking budget tuning.
  • Plan mode (--permission-mode plan) for safe codebase exploration before making changes.
  • Git worktrees (--worktree) for parallel sessions with isolated branches.

Writing Tools for Agents

Source: anthropic.com/engineering/writing-tools-for-agents

  • Tools bridge deterministic and non-deterministic systems. Agents may skip tools, hallucinate them, or use them in unexpected ways. Design for this uncertainty.
  • Fewer, well-designed tools beat many overlapping ones. Consolidate related operations.
  • Namespace tool names with consistent prefixes to reduce agent confusion.
  • Return high-signal data: human-readable names over UUIDs.
  • Allow agents to request output verbosity via a response_format parameter to manage their own context budget.
  • Treat tool descriptions like onboarding documentation. Make implicit knowledge explicit. Precise description adjustments produced state-of-the-art SWE-bench results.
  • Measure beyond accuracy: track call volume, redundant calls, error patterns, token consumption.
  • Error messages must guide better tool use, not return raw tracebacks.
  • Evaluation-driven development: prototype, test with 50+ realistic tasks grounded in real workflows, analyze reasoning not just accuracy, iterate.

Effective Context Engineering for AI Agents

Source: anthropic.com/engineering/effective-context-engineering-for-ai-agents

  • Context rot is real. Model accuracy declines as context grows. Treat context as a finite resource regardless of window size.
  • System prompt altitude: avoid overly rigid (hardcoded conditional logic) and overly vague. Use XML tags or Markdown headers to structure prompts. Start minimal, iterate based on failure modes.
  • If engineers can't decide which tool applies, agents won't either. Overlapping tools waste tokens and cause confusion.
  • Just-in-time context loading: store lightweight identifiers (file paths, queries, links) rather than full objects. Agents retrieve context dynamically during execution.
  • Hybrid retrieval: combine upfront loading (CLAUDE.md at start) with autonomous exploration (grep/glob at runtime).
  • Compaction: summarize conversation history before limits hit. Maximize recall first, then iterate for precision.
  • Agentic memory: maintain persistent external files consulted across context resets for multi-hour task progress and architectural decisions.
  • Sub-agent architecture for context efficiency: specialized agents handle focused tasks with clean windows, return condensed 1,000-2,000 token summaries.
  • Anti-patterns: assuming LLMs handle long contexts efficiently, pre-loading all potential data upfront, overly aggressive compaction causing information loss.

Multi-Agent Research System

Source: anthropic.com/engineering/multi-agent-research-system

  • Multi-agent (Opus lead + Sonnet subagents) outperformed single-agent Opus by 90.2% on research tasks. Parallelism reduces research time by up to 90%.
  • Token economics: multi-agent uses ~15x more tokens than chat; single agents use ~4x. High-value tasks required to justify cost.
  • Think like your agents: build simulations with exact prompts and tools to observe failure modes step-by-step.
  • Teach delegation: give subagents clear objectives, output formats, tool guidance, and task boundaries. Vague instructions cause duplicate work and missed information.
  • Scale effort to complexity: simple fact-finding uses 1 agent with 3-10 calls; complex research uses 10+ subagents.
  • Start wide, then narrow: begin with broad queries, evaluate, progressively narrow. Mirrors expert human research.
  • Parallel execution targets: 3-5 subagents in parallel, 3+ parallel tool calls per subagent.
  • Anti-patterns: spawning excessive subagents for simple queries, overly verbose search queries, pursuing nonexistent information indefinitely.

Measuring Agent Autonomy

Source: anthropic.com/research/measuring-agent-autonomy

  • Claude Code's longest sessions nearly doubled from under 25 minutes to over 45 minutes between October 2025 and January 2026.
  • Deployment overhang: METR estimates Claude can handle 5-hour tasks at 50% success. Actual median session is 45 seconds. Users underutilize the model's capability.
  • New users auto-approve ~20% of the time; experienced users reach 40%+. But experienced users also interrupt more often (5% to 9% of turns), shifting from approving everything to strategic monitoring and intervention.
  • Agents self-limit more than users do: agents ask for clarification more than twice as often on complex tasks compared to human interruptions.
  • Mandating exhaustive pre-approval workflows is counterproductive. Experienced users develop better instincts through monitoring.
  • 80% of tool calls have safeguards; 73% involve human oversight; only 0.8% are irreversible actions.
Related:wiki/researchwiki/research/coding-agent-best-practiceswiki/projects/agent-team
Blog code last updated on 2026-03-15: c04b780f9a9b20e56525019354100252a1c20141