Executive summary

Modern coding agents work best as agentic workflows rather than monolithic prompts: a bounded loop that (a) plans, (b) uses tools to gather evidence, (c) edits code, (d) runs verification, and (e) retries or escalates until a stop condition is met. This basic loop is explicitly supported by Claude's Agent SDK (the same loop used by Claude Code), including budgets (max_turns, max_budget_usd), parallel tool execution rules, and automatic compaction for long sessions.

For Claude model selection, the current best practice for cost-capability trade-offs is a routing architecture that uses Opus 4.6 for high-stakes reasoning and final synthesis, Sonnet 4.6 for most implementation work, and Haiku 4.5 for cheap, low-latency "inner-loop" tasks (searching, triage, summarisation, lint-like review). Claude's own docs explicitly position Opus as best for complex tasks and agents; Sonnet as the best speed-intelligence balance; and Haiku as the fastest model, with lower per-token prices.

For structuring multi-step work, subagents are the most consistently effective primitive because they isolate context and keep intermediate tool noise out of the main thread; Claude Code and the Agent SDK both support subagents and model/tool scoping (including explicit routing like model="haiku"|"sonnet"|"opus"). They are usually preferable to "parallel bash scripts that call multiple agents", unless you need process-level isolation, heterogeneous runtimes, or independent failure domains.

State and knowledge should be treated as tiers: short-term working context (prompt + current files), durable workflow state (checkpoints/sessions), and long-term knowledge (memory stores / RAG). Claude provides first-party primitives for all three: sessions and file checkpointing for workflow continuity, a client-side Memory tool for persistent cross-session storage, and server-side compaction plus context editing and prompt caching for long-running conversations and cost control.

The "standards stack" for 2024-2026 agent systems is converging around:

MCP (Model Context Protocol) for tool/data connectivity (JSON-RPC 2.0; host/client/server roles)
A2A (Agent2Agent) for agent-to-agent interoperability (agent discovery
- messaging + long-running tasks)
OpenTelemetry GenAI semantic conventions for vendor-neutral tracing/metrics of model and agent operations
Repo-level instruction files as a governance layer (Claude's CLAUDE.md; Codex-style AGENTS.md)

Defining coding agents and choosing between Opus, Sonnet, and Haiku

What a "coding agent" is in current practice

In 2022-2026 research and production systems, a coding agent is typically a controller around an LLM that alternates between reasoning and acting (tool use), rather than a single prompt completion. The ReAct paradigm formalised this interleaving of reasoning traces and actions to reduce hallucination and improve task trajectories.

Coding-agent-specific research further emphasises that interface design (the "agent-computer interface") materially changes performance: SWE-agent shows that shaping file navigation/editing/testing actions improves autonomous software-engineering outcomes.

Recommended Claude model roles

Current Claude model guidance and pricing support a three-tier routing approach:

Opus 4.6: lead planner + final arbiter for complex reasoning/high-risk edits; extended/adaptive thinking; up to 128K output tokens; $5/MTok input and $25/MTok output (standard).
Sonnet 4.6: default implementation model for most code changes; fast; extended + adaptive thinking; $3/MTok input and $15/MTok output.
Haiku 4.5: fast "worker" for search/summarise/classify/review and other cheap steps; extended thinking supported but no adaptive thinking; $1/MTok input and $5/MTok output.

A concrete best practice is to route based on risk and irreversibility:

Haiku: reversible, low-risk steps (codebase exploration, summarisation, log triage, change detection). Claude Code's built-in Explore subagent is explicitly Haiku and read-only to keep exploration out of main context.
Sonnet: routine code edits + test-running loops.
Opus: architectural decisions, multi-module refactors, security-sensitive diffs, final review and commit message synthesis.

Subagents versus parallel bash scripts for multi-agent work

Practical comparison

Claude provides multiple ways to "parallelise": subagents (single session), agent teams (multiple sessions), and plain scripting/CLI orchestration. Subagents are explicitly designed to preserve the main context and constrain tool access, and Claude Code recommends agent teams only when parallel workers must communicate with each other.

Comparison table: subagents vs parallel scripts vs Claude Code agent teams

Dimension	Subagents (within one session)	Parallel scripts (bash launching separate agents)	Agent teams (multi-session, coordinated)
Context isolation	Strong: each subagent has its own context; only final result returns	Strong: separate processes/sessions	Strong: each teammate has own context
Communication	Typically back to caller only	Whatever you build (files, pipes, queues)	Direct teammate-to-teammate messaging + shared task list
Tool/permission scoping	First-class (per subagent tools/permissions/model)	OS-level isolation; app-level scoping is DIY	Per session + platform controls; more moving parts
Coordination overhead	Low	Medium-high (merging, ordering, conflicts)	High (but built in); "significantly more tokens"
When it wins	Focused delegation where only the result matters; reduce context bloat	Heterogeneous runtimes; strict isolation; CI fan-out	Collaborative exploration/review requiring discussion
Key limitations	Subagents can't spawn subagents; requires clear descriptions	Harder to maintain deterministic state; error handling is DIY	Experimental; disabled by default; known limitations

Claude Code's own comparison highlights the core trade-off: subagents are lower token cost and report only to the main agent; agent teams enable direct inter-agent communication but cost more tokens and add overhead.

Recommendation pattern

Use subagents by default, and only "graduate" to parallel scripts or agent teams when one of these is true:

You need process-level isolation (untrusted repos; risky tools; differing dependency stacks) -> parallel scripts/containers.
You need independent concurrency with minimal coupling across many tasks (e.g., scanning 200 repos nightly) -> parallel scripts plus a workflow engine.
You need collaborative reasoning (debate, competing hypotheses, cross-layer coordination) -> agent teams.

Orchestration standards, frameworks, and reliability patterns

The emerging standards stack

MCP is an open protocol for integrating LLM applications with tools/data; it specifies host/client/server roles and uses JSON-RPC 2.0.
Claude's MCP connector lets you call remote MCP servers directly from the Messages API (beta header required), supporting tool calls over HTTP/SSE transports, with OAuth bearer token support.
A2A aims to standardise agent discovery and communication, and is documented both by major vendors and the Linux Foundation ecosystem.
OpenTelemetry GenAI semantic conventions provide standard span and attribute conventions for GenAI inference and agent operations, enabling vendor-neutral observability pipelines.

Orchestration frameworks and where they fit

A robust "coding agent" stack usually needs two layers: 1) Agent runtime (tool loop, permissions, subagents, sessions) 2) Workflow runtime (durable retries, scheduling, idempotency, fan-out/fan-in, backpressure)

Comparison table: orchestration frameworks and runtimes

Category	Example	Strengths for coding agents	Gaps / cautions
Claude-native agent runtime	Claude Agent SDK	Built-in agent loop, tool execution, permissions and hooks; subagents; sessions; automatic compaction; checkpointing; cost tracking; designed for code-edit + bash workflows	Still needs external job orchestration for large fleets/scheduling; sandboxing and credential isolation are your responsibility
Graph-based agent orchestration	LangGraph	Checkpointed persistence and durable execution; pause/resume; human-in-loop; "threads" and state snapshots useful for debugging/time-travel	Requires you to define nodes/state carefully; idempotency and determinism matter for "durable execution" claims
Multi-agent conversation framework	AutoGen	Research-backed multi-agent conversation patterns; async/event-driven scaling emphasised by v0.4 redesign	More DIY around security hardening and deterministic workflow semantics
Enterprise agent framework	Microsoft Agent Framework	Focus on open standards (MCP, OpenAPI), durability, approvals, security, OpenTelemetry-out-of-box; explicit "Workflow" abstraction with checkpoint/pause/resume	Tightly aligned to Azure ecosystem for many deployments
Vendor agent SDK	OpenAI Agents SDK	Tool use, handoffs to specialised agents, streaming, tracing support; pairs with repo instruction conventions (`AGENTS.md`)	Different primitives than Claude; keep provider abstraction boundaries clean
Durable workflow engine	Temporal	Standardised retries/timeouts and durable execution concept; fits long-running background "agent jobs"	You still pick an agent runtime; handle side effects carefully via activities/tasks
Data/work scheduler	Airflow / Prefect	Retries, caching, concurrency semantics; good for scheduled agent runs and batch evaluation pipelines	Not specialised for interactive tool loops; you still implement agent-level state

Messaging and control-flow patterns that consistently work

Research and production frameworks converge on a handful of patterns:

Plan-and-execute / orchestrator-worker: a lead agent decomposes tasks and dispatches to workers; aligns with multi-agent patterns described in AutoGen materials and Claude agent teams.
ReAct loop: interleave reasoning and tool use to reduce hallucinations and improve exception handling.
Reflect-and-retry: use explicit feedback signals or reflections to improve next attempts (Reflexion).
Search over reasoning paths (Tree of Thoughts): useful for high-stakes design decisions, but expensive; best reserved for Opus-gated "decision points".

Retries and error handling for Claude API and SDK loops

A production-grade Claude-based coding agent should distinguish:

Provider transient errors: 529 overloaded_error, 500 api_error -> retry with exponential backoff + jitter; keep idempotent tool semantics.
Rate limits: 429 rate_limit_error + retry-after header; inspect anthropic-ratelimit-* headers for remaining quota and reset time.
Agent loop budget exhaustion: the Agent SDK returns explicit result subtypes (error_max_turns, error_max_budget_usd) so you can resume/hand off.
Structured output repair loops: Agent SDK can stop with error_max_structured_output_retries when schema validation fails too many times; treat this as a prompt/schema bug or a routing signal (switch to Opus, simplify schema).

State, knowledge, and context management

State tiers and recommended primitives

A robust coding agent separates these layers:

Workflow state: "what have we done in this run?"
Agent SDK sessions persist conversation state across turns (and can be resumed/forked), and file checkpointing supports rewinding file changes made via Write/Edit/NotebookEdit.
Key limitation: checkpointing does not capture file edits done via shell commands (Bash), so high-integrity workflows should prefer SDK edit tools for tracked changes.
Short-term working memory: "what's in the current context window?"
Manage via subagents, tool-search, compaction/context-editing, and selective tool outputs.
Long-term memory / knowledge: "what should persist across runs?"
Claude's Memory tool is explicitly designed to store/retrieve information across conversations via a client-side /memories directory, enabling just-in-time retrieval. Restrict operations to /memories for security.
For broader organisational knowledge, use RAG and/or curated instruction files (CLAUDE.md, AGENTS.md) as stable anchors.

RAG, grounding, and consistency

Research from 2023-2025 consistently indicates that "more context" is not automatically better: long-context LLMs can degrade when fed too many retrieved passages, and retrieval should be selective and relevance-aware.

Practical best practices:

Selective retrieval and critique: Self-RAG shows gains from retrieving on-demand and reflecting/criticising retrieved passages, rather than always stuffing k passages.
Tiered memory: MemGPT formalises a useful engineering metaphor: treat context as fast memory and external stores as slow memory, swapping in/out via "interrupts" and retrieval policies.
Reflection summaries as durable artefacts: Generative Agents demonstrates storing experiences, synthesising reflections, and retrieving them dynamically. While not coding-specific, the memory/reflection pattern transfers well to "project memory" and "decision logs".

For output consistency and safe tool use, prefer mechanisms over prompting: Structured outputs (JSON schema output) + strict tool use (strict: true) are designed to guarantee valid JSON responses and validated tool parameters, with grammar compilation cached for 24 hours.

Claude-native context management toolbox

Claude provides several complementary levers:

Prompt caching: caches prompt prefixes (tools/system/messages up to cache breakpoints). It stores KV cache + cryptographic hashes, not raw text, and supports automatic caching and explicit breakpoints (up to 4). 5-minute TTL is default; 1-hour TTL is available at additional cost.
Cache-aware rate limits: for most models, cache_read_input_tokens don't count toward ITPM, improving effective throughput with high cache hit rates.
Compaction: server-side summarisation that emits a compaction block and drops earlier history; supported on Opus 4.6 and Sonnet 4.6 (beta header required).
Context editing: selective clearing strategies (tool result clearing and thinking block clearing) to keep context focused; context editing is beta and not eligible for ZDR.
1M token context: supported (beta) for Opus 4.6 and Sonnet 4.6 (and some Sonnet variants), requires a beta header and eligibility (usage tier 4/custom). Requests >200K tokens switch to premium pricing (2x input, 1.5x output) and dedicated rate limits.
Tool Search Tool + Programmatic Tool Calling: reduces tool-definition overhead and keeps intermediate tool results out of the main context by moving orchestration/data processing into a code execution environment; Anthropic reports large context savings and accuracy improvements for large tool libraries.

Comparison table: state and knowledge management approaches

Approach	Best for	Strengths	Failure modes / trade-offs
Subagents	Keeping main context clean	Fresh context per subtask; only final summary returns; supports model/tool scoping	Requires good descriptions; subagents don't inherit parent history; can't recursively spawn
Sessions + file checkpointing	Long-running code-edit loops	Resume/fork; rewind tracked file edits to prior states	Bash edits aren't tracked; rewind doesn't rewind conversation
Memory tool	Persistent cross-session project memory	Client-side control; just-in-time retrieval; works with ZDR arrangements	You must implement storage + enforce `/memories` confinement
Prompt caching	Reducing repeated prefix cost/latency	Large cost savings when prompts share stable prefixes	Needs careful breakpoint placement; concurrency caveat (cache entry becomes available after first response begins)
Compaction / context editing	Very long sessions	Automatic summarisation/clearing to stay within context	Summaries can omit details; beta constraints, ZDR eligibility differs
RAG + vector store	Org knowledge, doc grounding	Freshness, traceability, targeted context	Retrieval noise can hurt; long-context + RAG can degrade if you stuff too much

Testing, observability, and debugging practices

Evaluation-first development

Claude's testing guidance is aligned with standard evaluation-driven engineering: define specific, measurable, achievable, relevant success criteria, then build evaluations around them.

Claude Console includes an Evaluation tool for prompt testing with variable-driven test sets (including CSV import and generated test cases).

Deterministic validation hooks for coding agents

For coding agents, the most reliable "truth signals" remain: Compiler/ typecheck output - Unit/integration test results - Lint/format checks - Static/dynamic security scans - Build output / package constraints

Operationally, Claude's Agent SDK loop makes these checks cheap to integrate because Bash and file tools are first-class, and you can cap spend with max_budget_usd and turns with max_turns.

Guardrails against hallucinations and prompt leakage

Claude's guardrail docs emphasise:

Make the system auditable with quotes/citations and verification checks
Allow explicit "I don't know" responses
Restrict models to provided documents when grounding matters

For prompt leakage, Claude recommends starting with monitoring and post-processing (screening outputs), and warns that "leak-proofing" prompts can degrade performance due to complexity.

Observability stack and standardisation

Adopt OpenTelemetry for traces/metrics/logs end-to-end, using GenAI semantic conventions for spans.

Claude also supports organisation-level analytics via the Claude Code Analytics Admin API, including tool acceptance rates, model-level costs, and productivity metrics; it explicitly positions this as bridging basic dashboards and more complex OpenTelemetry integrations.

Comparison table: observability and debugging tooling

Tooling area	Recommended baseline	Why it matters for agents
Distributed tracing	OpenTelemetry GenAI conventions	Standard fields across vendors/frameworks; correlates tool calls, model calls, retries
Agent-run introspection	Claude Agent SDK message stream + result subtypes	Distinguish provider errors vs budget exhaustion vs schema failures
Change control	File checkpointing + git patch review gates	Enables safe exploration and rollback paths
Prompt/output reliability	Structured outputs + strict tool use	Turns "parsing" into compile-time grammar constraints; cached grammars reduce repeat latency
Organisational rollouts	Claude Code Analytics API + usage/cost API	Track adoption, cost, model mix, tool acceptance to tune policies

Deployment, scaling, and cost-efficiency

Secure-by-default deployment shape

Claude's hosting guidance for the Agent SDK recommends container-based sandboxing to isolate processes, control resources and network access, and support ephemeral filesystems. It also provides baseline resource guidance (e.g., ~1 GiB RAM, ~5 GiB disk, 1 CPU per instance as a starting point) and notes SDK runtime requirements for Python/Node.

Claude's secure deployment guidance strongly recommends a proxy pattern that injects credentials outside the agent's security boundary so the agent never sees secrets; the proxy can also enforce allowlists and log requests for auditing.

Scalability patterns

For scaling beyond a handful of agents, treat agent runs as jobs and adopt queue-based backpressure:

Interactive: long-lived agent instances with streaming UI; stricter permission modes and human approvals.
Batch/offline: schedule runs via workflow engines (Temporal/Prefect/ Airflow), store outputs + traces, and gate merges via CI.

Claude's Messages Batch API offers 50% pricing for asynchronous workloads, supports Messages features (vision/tool use/multi-turn/betas), and notes practical constraints (processing demand, potential expiry after 24 hours, and possible spend-limit overshoot).

Cost levers that matter most in practice

From highest leverage to lower:

Prompt caching: cache-hit reads priced at a fraction of base input, with explicit multipliers documented; supports multiple breakpoints and has concurrency caveats.
Batch processing: 50% discount for non-latency-sensitive workloads.
Model routing: Haiku for cheap steps, Sonnet for default edits, Opus only where needed.
Tool Search Tool / deferred tool loading: avoid 10K-100K tokens of tool schema overhead in MCP-heavy setups; Anthropic reports large savings and accuracy gains.
Programmatic Tool Calling: keep intermediate tool results out of model context and reduce round-trips; Anthropic reports token reductions on complex workflows.
Token counting: preflight token usage and route accordingly (especially before expensive long-context calls).

Governance, safety, and access control

Access control and permissions as first-class design constraints

Claude Agent SDK permission evaluation is explicit and ordered: hooks first, then deny rules, then allow rules, then a canUseTool callback; deny rules override even bypass modes.

Operational best practice: Default to deny-by-default in production agents, progressively allowlist tools and even sub-commands (e.g., Bash(npm:*)) as confidence grows. Implement approvals for high-impact actions (writes, network calls, credentialed operations), and use hooks to enforce policy.

Threat model alignment and best-practice references

Use well-recognised security frameworks to structure governance:

OWASP Top 10 for LLM Applications (prompt injection, insecure output handling, model DoS, supply chain, etc.) as an application security checklist.
NIST AI RMF for organisational risk management structure and controls.
Claude-specific research on prompt injection defences for browsing agents, and Anthropic's framework for safe agents.

Standards and governance drift control

The 2025-2026 ecosystem is explicitly pushing open, neutral standards bodies for agent infrastructure (e.g., AAIF under the Linux Foundation, with contributions including MCP and AGENTS.md). This trend matters for governance because it reduces vendor lock-in and encourages consistent policy surfaces across tools.

Recommended architectures, example flows, and implementation roadmap

Recommended architecture for a production coding agent with Claude models

A practical "default" architecture that scales from laptop to Kubernetes:

Controller (Opus): owns task plan, risk policy, final synthesis, and escalation decisions.
Workers (Sonnet/Haiku subagents): do exploration, edits, tests, reviews, and summarisation in isolated contexts.
Durable state: sessions + checkpointing for safe rollback; memory tool + RAG for cross-run context.
Tool layer: local tools (read/edit/bash), MCP servers for external systems; tool search to keep schemas small.
Governance: permission rules + hooks + secret-injection proxy; audit logging via OpenTelemetry.

Mermaid orchestration flow

flowchart TD
  U[User or CI Job] --> O[Opus 4.6: Orchestrator]
  O -->|decompose| Q{Task type}
  Q -->|Explore / summarise| H[Haiku 4.5 subagent]
  Q -->|Implement / refactor| S[Sonnet 4.6 subagent]
  Q -->|High-risk review| R[Opus 4.6 reviewer subagent]

  H --> TB[Tool layer]
  S --> TB
  R --> TB

  TB --> FS[Read/Edit/Bash tools]
  TB --> MCP[MCP tools via servers]
  MCP --> TS[Tool Search / deferred loading]

  O --> MEM[Memory tool /memories]
  O --> CKPT[File checkpointing]

  O --> CI[Run tests / lint / build]
  CI -->|pass| DONE[Commit / PR / report]
  CI -->|fail| O

Mermaid timeline for a typical "fix failing tests" run

sequenceDiagram
  participant Lead as Lead (Opus)
  participant Run as Test Runner (Sonnet)
  participant Exp as Explorer (Haiku)
  participant Tools as Tools (Bash/Read/Edit)

  Lead->>Tools: checkpoint + gather context
  par Explore codebase
    Lead->>Exp: find relevant modules & summarise
    Exp->>Tools: Read/Grep/Glob
    Exp->>Lead: summary + candidate root causes
  and Run tests
    Lead->>Run: run tests, capture failures
    Run->>Tools: Bash "npm test"
    Run->>Lead: failing tests + logs
  end
  Lead->>Tools: Edit/Write changes (tracked)
  Lead->>Tools: Bash "npm test" rerun
  alt tests pass
    Lead->>Lead: synthesise explanation + PR plan
  else tests fail
    Lead->>Lead: adjust plan, retry or escalate effort/model
  end

Example code: Claude Agent SDK with Opus lead and Sonnet/Haiku subagents

import asyncio
from claude_agent_sdk import query, ClaudeAgentOptions, AgentDefinition

async def main():
    opts = ClaudeAgentOptions(
            # Allow subagent invocation and core dev tools
            allowed_tools=["Read", "Edit", "Write", "Bash", "Grep", "Glob",
            "Agent"],
            # Route work by subagent description + explicit model overrides
            agents={
                "explorer": AgentDefinition(
                    description="Read-only codebase exploration and
summarisation.",
                    prompt="Explore the repo to find the relevant files and
summarise findings.",
                    tools=["Read", "Grep", "Glob"],
                    model="haiku",
                ),
                "implementer": AgentDefinition(
                    description="Implements code changes and refactors safely;
runs tests as needed.",
                    prompt="Implement requested changes. Prefer small diffs. Run
tests after edits.",
                    tools=["Read", "Edit", "Write", "Bash", "Grep"],
                    model="sonnet",
                ),
                "security-reviewer": AgentDefinition(
                    description="High-rigor security review; use for auth/crypto/
secrets risks.",
                    prompt="Review for security vulnerabilities. Be strict; cite
exact code lines.",
                    tools=["Read", "Grep", "Glob"],
                    model="opus",
                ),
            },
            # Production defaults
            max_turns=20,
            max_budget_usd=5.00,
    )

    async for msg in query(
        prompt="Fix failing tests in auth.ts. Use explorer then implementer;
run tests.",
        options=opts,
    ):
        if hasattr(msg, "result"):
            print(msg.result)

asyncio.run(main())

This pattern is directly supported by the Agent SDK's subagent model routing ('sonnet'|'opus'|'haiku'|'inherit') and its requirement that the Agent tool is allowed for subagent invocation, with context isolation as the main value.

Example code: Messages API with parallel tool use controls

import anthropic

client = anthropic.Anthropic()

resp = client.messages.create(
    model="claude-sonnet-4-6",
    max_tokens=4000,
    tool_choice={"type": "auto", "disable_parallel_tool_use": False},
    tools=[
        {"name": "get_repo_files", "description": "List files",
"input_schema": {"type": "object", "properties": {}}},
        {"name": "run_tests", "description": "Run tests", "input_schema":
{"type": "object", "properties": {}}},
    ],
    messages=[{"role": "user", "content": "List relevant files and run the
unit tests."}],
)

Claude docs explicitly describe parallel tool use, and how to disable it via disable_parallel_tool_use.

Prioritised checklist of best practices

High priority:

Use subagents to isolate context (separate "explore", "implement", "review"), and route models: Haiku for exploration, Sonnet for implementation, Opus for high-stakes review/synthesis.
Enforce least privilege with allow/deny rules + hooks; treat deny rules as non-bypassable.
Implement automatic rollback with file checkpointing; avoid untracked Bash edits for critical modifications.
Standardise outputs through structured outputs + strict tool use where machine consumption matters.
Build evaluation sets and measure success; use Claude's evaluation tooling for prompt iteration.

Medium priority:

Use prompt caching (and careful breakpoints) for stable prefixes; monitor cache hit rates and exploit cache-aware rate limits.
Use Batch API for non-latency-sensitive workloads (nightly scans, large-scale refactors/evals).
Adopt MCP + tool search for large tool catalogues; avoid shipping 50K-100K tokens of tool schemas every turn.

Lower priority (but valuable at scale):

Introduce a durable workflow engine (Temporal/PREFECT/Airflow) when you need scheduling, strong retries, and fleet-scale backpressure.
Standardise observability via OpenTelemetry GenAI conventions; integrate org monitoring with analytics APIs.

Short implementation roadmap

Week 1-2: Adopt CLAUDE.md (project instructions) and define three subagents: explorer (Haiku read-only), implementer (Sonnet), reviewer (Opus). Add budgets (max_turns, max_budget_usd) and checkpointing to make failures cheap and reversible.

Week 3-4: Add structured outputs for agent "plans" and "change manifests"; enforce strict tool schemas for high-value tools. Add prompt caching + token counting preflights; establish a cost dashboard and alert thresholds.

Month 2: Add Memory tool for cross-run project continuity (decision logs, build commands, recurring pitfalls). Integrate MCP servers for your SDLC systems; enable tool search/deferred loading if schemas get large.

Month 3: Harden isolation: container sandboxing with credential-injection proxy; formalise policy hooks; implement OpenTelemetry traces. If running large fleets: migrate job orchestration to a durable workflow engine and schedule batch runs via Batch API for cost efficiency.

References

How the agent loop works - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/agent-loop
Models overview - Claude API Docs https://platform.claude.com/docs/en/about-claude/models/overview
Create custom subagents - Claude Code Docs https://code.claude.com/docs/en/sub-agents
Memory tool - Claude API Docs https://platform.claude.com/docs/en/agents-and-tools/tool-use/memory-tool
Specification - Model Context Protocol https://modelcontextprotocol.io/specification/2025-11-25
Announcing the Agent2Agent Protocol (A2A) https://developers.googleblog.com/en/a2a-a-new-era-of-agent-interoperability/
Semantic conventions for generative client AI spans https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
Claude Code overview - Claude Code Docs https://code.claude.com/docs/en/overview
ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629
SWE-agent - Computer Science > Software Engineering https://arxiv.org/abs/2405.15793
Models overview - Claude API Docs https://platform.claude.com/docs/en/about-claude/models/overview
Create custom subagents - Claude Code Docs https://code.claude.com/docs/en/sub-agents
Pricing - Claude API Docs https://platform.claude.com/docs/en/about-claude/pricing
Effective context engineering for AI agents - Anthropic https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Orchestrate teams of Claude Code sessions - Claude Code Docs https://code.claude.com/docs/en/agent-teams
Orchestrating Subagents & Claude Skills - Reddit https://www.reddit.com/r/vibecoding/comments/1pg1y75/orchestrating_subagents_claude_skills_much_better/
Specification - Model Context Protocol https://modelcontextprotocol.io/specification/2025-11-25
MCP connector - Claude API Docs https://platform.claude.com/docs/en/agents-and-tools/mcp-connector
How the agent loop works - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/agent-loop
Hosting the Agent SDK - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/hosting
Persistence - Docs by LangChain https://docs.langchain.com/oss/python/langgraph/persistence
Durable execution - Docs by LangChain https://docs.langchain.com/oss/python/langgraph/durable-execution
AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation - Microsoft Research https://www.microsoft.com/en-us/research/publication/autogen-enabling-next-gen-llm-applications-via-multi-agent-conversation-framework/
Introducing Microsoft Agent Framework - Microsoft Foundry Blog https://devblogs.microsoft.com/foundry/introducing-microsoft-agent-framework-the-open-source-engine-for-agentic-ai-apps/
Agents SDK - OpenAI API https://developers.openai.com/api/docs/guides/agents-sdk/
Retry logic in Workflows: Best practices for failure handling https://temporal.io/blog/failure-handling-in-practice
Tasks - Airflow 3.1.8 Documentation https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/tasks.html
microsoft.com https://www.microsoft.com/en-us/research/wp-content/uploads/2025/01/WEF-2025_Leave_Behind_AutoGen.pdf
ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629
Reflexion: Language Agents with Verbal Reinforcement Learning https://arxiv.org/abs/2303.11366
Tree of Thoughts: Deliberate Problem Solving with Large Language Models https://arxiv.org/abs/2305.10601
Errors - Claude API Docs https://platform.claude.com/docs/en/api/errors
Errors - Claude API Docs https://platform.claude.com/docs/en/api/errors
How the agent loop works - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/agent-loop
How the agent loop works - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/agent-loop
Work with sessions - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/sessions
Rewind file changes with checkpointing - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/file-checkpointing
Effective context engineering for AI agents - Anthropic https://www.anthropic.com/engineering/effective-context-engineering-for-ai-agents
Memory tool - Claude API Docs https://platform.claude.com/docs/en/agents-and-tools/tool-use/memory-tool
Claude Code overview - Claude Code Docs https://code.claude.com/docs/en/overview
LONG-CONTEXT LLMS MEET RAG https://proceedings.iclr.cc/paper_files/paper/2025/file/5df5b1f121c915d8bdd00db6aac20827-Paper-Conference.pdf
Self-RAG: Learning to Retrieve, Generate, and Critique through Self-Reflection https://arxiv.org/abs/2310.11511
MemGPT: Towards LLMs as Operating Systems https://arxiv.org/abs/2310.08560
Generative Agents: Interactive Simulacra of Human Behavior https://arxiv.org/abs/2304.03442
Structured outputs - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/structured-outputs
Prompt caching - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Rate limits - Claude API Docs https://platform.claude.com/docs/en/api/rate-limits
Compaction - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/compaction
Context editing - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/context-editing
Context windows - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/context-windows
Introducing advanced tool use on the Claude Developer Platform - Anthropic https://www.anthropic.com/engineering/advanced-tool-use
Subagents in the SDK - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/subagents
Rewind file changes with checkpointing - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/file-checkpointing
Prompt caching - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Compaction - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/compaction
Define success criteria and build evaluations - Claude API Docs https://platform.claude.com/docs/en/test-and-evaluate/develop-tests
Using the Evaluation Tool - Claude API Docs https://platform.claude.com/docs/en/test-and-evaluate/eval-tool
Reduce hallucinations - Claude API Docs https://platform.claude.com/docs/en/test-and-evaluate/strengthen-guardrails/reduce-hallucinations
Reduce prompt leak - Claude API Docs https://platform.claude.com/docs/en/test-and-evaluate/strengthen-guardrails/reduce-prompt-leak
Claude Code Analytics API - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/claude-code-analytics-api
Claude Code Analytics API - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/claude-code-analytics-api
Hosting the Agent SDK - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/hosting
Securely deploying AI agents - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/secure-deployment
Batch processing - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/batch-processing
Prompt caching - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Batch processing - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/batch-processing
Introducing advanced tool use on the Claude Developer Platform - Anthropic https://www.anthropic.com/engineering/advanced-tool-use
Introducing advanced tool use on the Claude Developer Platform - Anthropic https://www.anthropic.com/engineering/advanced-tool-use
Token counting - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/token-counting
Configure permissions - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/permissions
Claude Code overview - Claude Code Docs https://code.claude.com/docs/en/overview
OWASP Top 10 for Large Language Model Applications https://owasp.org/www-project-top-10-for-large-language-model-applications/
Artificial Intelligence Risk Management Framework (AI RMF 1.0) https://nvlpubs.nist.gov/nistpubs/ai/nist.ai.100-1.pdf
Mitigating the risk of prompt injections in browser use https://anthropic.com/research/prompt-injection-defenses
OpenAI co-founds the Agentic AI Foundation under the Linux Foundation https://openai.com/index/agentic-foundation/
Subagents in the SDK - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/subagents
Rewind file changes with checkpointing - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/file-checkpointing
Connect to external tools with MCP - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/mcp
Configure permissions - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/permissions
How the agent loop works - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/agent-loop
Subagents in the SDK - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/subagents
How to implement tool use - Claude API Docs https://platform.claude.com/docs/en/agents-and-tools/tool-use/implement-tool-use
Define success criteria and build evaluations - Claude API Docs https://platform.claude.com/docs/en/test-and-evaluate/develop-tests
Prompt caching - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Introducing advanced tool use on the Claude Developer Platform - Anthropic https://www.anthropic.com/engineering/advanced-tool-use
Retry logic in Workflows: Best practices for failure handling https://temporal.io/blog/failure-handling-in-practice
Semantic conventions for generative client AI spans https://opentelemetry.io/docs/specs/semconv/gen-ai/gen-ai-spans/
Claude Code overview - Claude Code Docs https://code.claude.com/docs/en/overview
How the agent loop works - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/agent-loop
Prompt caching - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/prompt-caching
Connect to external tools with MCP - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/mcp
Hosting the Agent SDK - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/hosting
Batch processing - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/batch-processing
Structured outputs - Claude API Docs https://platform.claude.com/docs/en/build-with-claude/structured-outputs
Rate limits - Claude API Docs https://platform.claude.com/docs/en/api/rate-limits
Define success criteria and build evaluations - Claude API Docs https://platform.claude.com/docs/en/test-and-evaluate/develop-tests
Subagents in the SDK - Claude API Docs https://platform.claude.com/docs/en/agent-sdk/subagents
Introducing advanced tool use on the Claude Developer Platform - Anthropic https://www.anthropic.com/engineering/advanced-tool-use
OWASP Top 10 for Large Language Model Applications https://owasp.org/www-project-top-10-for-large-language-model-applications/
Rate limits - Claude API Docs https://platform.claude.com/docs/en/api/rate-limits
ReAct: Synergizing Reasoning and Acting in Language Models https://arxiv.org/abs/2210.03629

Kyle Pericak