BlogWikiAbout

Kyle Pericak

"It works in my environment"

Bot-Wiki/Research/Design Docs Optimized for AI Coding Agents/Claude Deep Research: Design Docs for AI Coding Agents

Claude Deep Research: Design Docs for AI Coding Agents

Last verified: 2026-03-16

Design docs rewritten for AI coding agents

The best technical design documents are evolving from human approval artifacts into machine-executable plans. As AI coding agents like Claude Code, Cursor, and Kiro become primary implementers, the structure, granularity, and maintenance of design docs directly determines implementation quality. A new paradigm called spec-driven development (SDD) has emerged across GitHub Spec Kit, AWS Kiro, and practitioner workflows, converging on a shared pattern: requirements → design → dependency-ordered tasks → agent execution. The evidence base is practitioner-heavy and tool-documentation-rich, but thin on controlled studies — most guidance derives from production experience at companies shipping with AI agents daily, not academic research.

This report synthesizes published templates from Google, Uber, GitLab, and others; tool documentation from Spec Kit, Kiro, and Copilot Workspace; and practitioner accounts from engineering blogs and community discussions across all six requested dimensions.


What best-in-class templates actually contain

Design doc structures across top companies share a common skeleton but diverge in emphasis. Google's template — the most widely cited — centers on five sections: Context and Scope, Goals and Non-goals, The Actual Design, Alternatives Considered (which Google calls "one of the most important sections"), and Cross-cutting Concerns. Google's docs run 10–20 pages for large projects, with "mini design docs" of 1–3 pages for incremental work. The cultural marker is distinctive: "Where is the design doc?" is the first question engineers ask about unfamiliar systems.

Uber evolved through three generations — DUCK (early), RFC (growth-stage), and ERD (current) — driven by noise problems. Hundreds of RFCs weekly forced a tiered criticality system: the most critical proposals get formal weekly reviews by senior engineers, while less critical ones have lighter processes. Uber's templates are domain-specific: services specs include SLA sections, mobile specs include third-party library considerations, payments specs include regional legal checkpoints.

GitLab stores design docs as version-controlled files updated via merge requests — the only major company treating design docs as Git-native artifacts. Their template includes structured metadata (status, authors, coach, DRIs, owning stage) followed by Summary, Motivation, Goals, Non-Goals, Proposal, Design and Implementation Details, and Alternative Solutions. Stripe takes the opposite approach: no standard template at all, instead using "sample documents" (real completed docs) as exemplars. Stripe's writing culture runs from CEO to IC — Patrick Collison structures emails like research papers — and documentation quality is embedded in career ladders and performance reviews.

The community has produced several widely-cited templates. Lambros Petrou's RFC template is comprehensive (13 sections from Problem & Context through FAQ to Appendix). The MADR (Markdown Architectural Decision Records) format adds YAML frontmatter for machine-readable metadata. The Rust RFC template (Summary, Motivation, Detailed design, Drawbacks, Alternatives, Unresolved questions) has inspired most open-source RFC processes.

How these differ from PRDs: The PRD captures what to build and why (owned by product managers, written in business language, covering user stories, success metrics, and constraints). The design doc captures how to build it and what trade-offs were made (owned by engineers, written in technical language, covering architecture, APIs, data models, and alternatives). The minimum viable design doc needs just five sections: Context, Goals/Non-goals, Proposed Design with one diagram, Key Trade-offs, and Open Questions. A comprehensive RFC adds detailed technical design, cross-cutting concerns, timeline, dependencies, testing strategy, and FAQ.


How AI tools are automating the PRD-to-design-doc handoff

Three tools now implement the full requirements-to-implementation pipeline, and they've converged on remarkably similar architectures.

GitHub Spec Kit (open-source, September 2025) implements a strictly linear four-phase workflow via slash commands. A constitution.md captures non-negotiable project principles. /specify generates a detailed feature spec from natural-language goals. /plan produces a technical implementation plan. /tasks breaks the plan into dependency-ordered, testable work items marked with [P] for parallelizable tasks. /analyze validates consistency across all artifacts. The key design principle is that specs are decoupled from code — enabling multi-variant implementations from the same spec. Practitioner reception is mixed: some report 5x productivity gains, while others found agents claiming completion when "most of the functionality is missing and there are zero tests." Critics worry it "brings back the worst parts of Waterfall under a shinier name."

AWS Kiro (public preview July 2025, powered by Claude Sonnet) implements a three-phase spec workflow within a full IDE. Requirements are generated as user stories with EARS (Easy Approach to Requirements Syntax) notation acceptance criteria. Design docs include data flow diagrams, TypeScript interfaces, database schemas, and Mermaid diagrams — critically, these are generated after analyzing the existing codebase. Tasks are sequenced by dependencies with traceability back to requirement numbers. Kiro's distinctive feature is bidirectional sync: developers can update specs from code changes or update code from spec changes. The "Run All Tasks" feature uses property-based tests and subagents to validate each step. Practitioner consensus: "Cursor wins for fast iteration; Kiro wins for complex features needing upfront design clarity."

GitHub Copilot Workspace (technical preview, April 2024) starts from GitHub Issues and generates a current-state specification, proposed-state specification, file-level change plan, and code diffs. Every layer is editable — change the spec and the plan regenerates; change the plan and the code regenerates. This steerability pattern is now being folded into the main Copilot product as a "Plan agent" in VS Code.

BMAD Method takes a different approach: agent personas simulating an entire agile team (Analyst, PM, Architect, Scrum Master, Developer, QA) passing structured artifacts between roles. Documentation becomes the source of truth — "code becomes merely a downstream derivative."

All four tools converge on the same insight: the transition from PRD to implementation requires an intermediate representation — structured enough for agents to execute, flexible enough for humans to steer. What carries forward from the PRD: user needs, success criteria, non-functional requirements, scope boundaries. What gets transformed: "users should be able to filter by dietary preferences" becomes GET /api/v1/meals?diet=vegan&allergens=nuts with <200ms latency. What gets left behind: market analysis, competitive landscape, go-to-market strategy, stakeholder politics.


Structuring design docs so agents can actually use them

The most actionable finding in this research concerns how design docs should be formatted for AI agent consumption. The evidence converges on several principles with varying confidence levels.

Keep instruction files radically short. Research cited by HumanLayer (a YC company) shows frontier thinking LLMs can reliably follow ~150–200 instructions. Claude Code's system prompt consumes ~50, leaving ~100–150 for CLAUDE.md, rules files, and user messages. As instruction count increases, adherence degrades uniformly — not just for later instructions, but for all of them. HumanLayer's own production CLAUDE.md is under 60 lines. The general consensus: under 300 lines, shorter is better.

Use progressive disclosure. Rather than stuffing everything into a root CLAUDE.md, create an agent_docs/ directory with separate files (architecture.md, conventions.md, testing.md) and instruct the agent to read relevant ones before starting specific work. This mirrors how Claude Code's own skills architecture works. Builder.io's CTO abandoned Cursor for Claude Code partly because this progressive disclosure pattern handled an 18,000-line React component that "no AI agent has ever successfully updated except Claude Code."

Markdown is the universal format; YAML is for metadata only. Every major tool — Claude Code, Cursor, Kiro, Copilot, Codex — uses plain Markdown for agent instructions. YAML appears exclusively in frontmatter for scoping metadata (glob patterns, status flags, dependency lists). Practitioners report that structured Markdown with clear headings creates natural "attention cues" for LLMs. Anthropic's own guidance recommends organizing prompts into distinct sections and occasionally adding emphasis with "IMPORTANT" or "YOU MUST" to improve adherence.

The spec-first workflow is the emerging standard. Multiple independent sources converge: use Plan Mode to research and generate a spec, save it as SPEC.md, have the human review and annotate it, then start a fresh session for implementation with clean context. Boris Tane's workflow (9 months of Claude Code production use) makes this concrete: Research → plan.md → 1–6 annotation cycles where the human marks up the plan → todo list → implementation. His key insight: "I want implementation to be boring. The creative work happened in the annotation cycles."

Include executable verification commands so agents can self-check. Builder.io recommends file-scoped commands over project-wide ones: npm run tsc --noEmit path/to/file.tsx rather than npm run build, since "agents frequently execute full project-wide build commands unnecessarily."

Specify boundaries explicitly. Practitioners consistently report that telling agents what not to touch is as important as telling them what to do. But negative-only instructions fail — "Never use --foo-bar" causes agents to get stuck, while "Never use --foo-bar; prefer --baz instead" works. One practitioner with a 300k-line legacy codebase reported "amazing success" with an agents.md file that "prevents hallucinations and coerces AI to use existing files and APIs instead of inventing them."

⚠️ Evidence flag: A March 2026 study found that LLM-generated AGENTS.md files can lower performance, while human-written ones improve it. Auto-generated config files (/init outputs) should be treated as starting points, not finished products.


Why trade-off documentation is now a technical requirement, not a nice-to-have

The "amnesiac agent" problem is the strongest evidence that trade-off documentation has shifted from a process nicety to a functional requirement. Mahdi Yusuf's January 2026 account is the most cited practitioner report: "Your coding agent is not malicious. It's amnesiac. Every session starts fresh. The agent sees your code but not the eighteen months of decisions that shaped it." His concrete example: an agent consolidated three microservices into one because the separation "added unnecessary complexity" — but that separation existed because the services scaled differently under load.

The root cause is well-understood. LLMs have working memory, not long-term memory. Academic research confirms that when presented with missing critical information, LLMs rarely pause to request it — they force solutions by making assumptions. This means agents will independently re-derive and re-propose approaches that the team already considered and rejected, unless those rejections are documented in a format the agent can read.

Architecture Decision Records (ADRs) are the most structured format for capturing trade-offs. Michael Nygard's original 2011 template (Context, Decision, Status, Consequences) deliberately omitted an "Alternatives Considered" section — a gap that MADR 4.0 fills with explicit "Considered Options" and per-option pros/cons sections with YAML frontmatter for machine-readable metadata. Google's design docs treat the Alternatives Considered section as "one of the most important" — if a design doc doesn't discuss trade-offs, Google engineers argue the doc shouldn't exist.

The emerging best practice is to treat ADRs as agent memory, not human documentation. Yusuf's protocol: add machine-readable metadata (status, subsystem, supersedes, related) to each ADR; create an agent-navigable graph so an agent touching the auth subsystem can walk back five decisions; require agents to read relevant ADRs before starting work and draft new ADRs when proposing significant changes. A purpose-built GitHub repository (tom-gerken/ADR) implements a "Decision Hierarchy" with ADRs for why, a Memory Bank for current state, and Code for how — with an explicit reading order for agents starting fresh sessions.

Claude Code users are already requesting better ADR support. GitHub issue #15222 proposes DECISIONS.md with [REJECTED] markers. Issue #13853 from a platform team requests automatic ADR loading because "CLAUDE.md becomes unwieldy with multiple architectural decisions."

⚠️ Evidence flag: There are no controlled studies comparing AI agent behavior with versus without ADRs in context. Claims about ADRs as "agent memory" are practitioner hypotheses, not empirically validated. The entire practice is emerging (2025–2026) with no longitudinal data.


The two-layer architecture and optimal task granularity

McKinsey's QuantumBlack published the most structured framework for AI-assisted implementation phasing in February 2026. Their two-layer architecture separates orchestration from execution:

  • Layer 1 (Orchestration): A deterministic, rule-based workflow engine that enforces phase transitions, manages dependencies, tracks artifact state, and triggers agents. Agents never decide what phase they're in or what comes next. McKinsey's team found that "early on, we experimented with letting agents orchestrate themselves... On larger codebases, agents routinely skipped steps, created circular dependencies, or got stuck in analysis loops."

  • Layer 2 (Execution): Specialized agents operating within bounded instructions, where every output goes through two-stage evaluation: deterministic checks (linters, structural validation) followed by a critic agent that validates against definitions of done. Agents get 3–5 iteration attempts before failing and rolling back for human intervention.

The workflow progresses through four sequential phases on a single feature branch: Requirements (structured requirement.md with frontmatter) → Architecture (technical decisions with rationale) → Technical Tasks (TASK-001, TASK-002 with parent requirement, files to modify, acceptance criteria, dependencies) → Implementation (coding with automated evals). The eval system checks for a valid DAG with no circular dependencies.

McKinsey's framing of the economics is provocative: "Waterfall got a bad reputation not because sequential phases are inherently wrong, but because the economics didn't work... Agents change the economics. When an agent can execute the full cycle in hours, not months, you can afford the structure."

On task granularity, practitioner consensus converges on a sweet spot despite no rigorous empirical data. Boris Tane's probability analysis illustrates why broad tasks fail: at 80% accuracy per decision across 20 decisions, there's only a ~1% chance of an all-correct implementation (0.8^20). SWE-bench Pro data provides the closest empirical signal: success rates drop from 70% on simple tasks to 23% on enterprise-complexity multi-file edits. Tasks that are too narrow create coordination overhead and strip agents of holistic insight — Amazon Science notes that task decomposition "can come at the cost of the novelty and creativity that larger models display."

The practical sweet spot, per multiple independent sources: a single coherent change that can be tested in isolation — typically a model plus its migration, a service plus its tests, or an API endpoint plus validation logic. SFEIR Institute quantifies this as "atomic subtasks of 5–10 minutes." Addy Osmani recommends "implement one function, fix one bug, add one feature at a time." Each task should specify the parent requirement, files to create or modify, acceptance criteria, and dependencies on other tasks.

Dependencies are represented in three ways across tools: linear sequences (simplest), directed acyclic graphs with parallel markers (Spec Kit's [P] notation), and hierarchical nesting (requirements containing tasks containing sub-tasks). McKinsey's task templates include explicit dependencies: [TASK-001, TASK-002] fields. The AGENTS.md cross-platform standard (launched late 2025 by Google, OpenAI, Factory, Sourcegraph, and Cursor) provides a vendor-neutral layer for these conventions.


How AI changes — and amplifies — design doc failure modes

Traditional design doc failures are well-documented: staleness (documents abandoned by the second sprint), bikeshedding (disproportionate review time on trivial details), bureaucratic theater (docs written for approval rather than guidance), and over-specification (writing "code in English" before understanding query patterns). Lucas F. Costa's February 2026 critique captures the core failure: "Ask yourself how many design docs at your company get updated after implementation starts. If the doc were genuinely useful as a design tool, you'd update it as you learn. Nobody does."

AI introduces four specific new failure modes, each with moderate-to-strong practitioner evidence:

Context drift is the most widely reported. Agents start sessions strong, but as the context window fills, the original plan "falls out of the LLM's brain." Roger Wong documents the pattern: "You start a session and output is sharp. Forty minutes in, it's forgotten your constraints and is hallucinating component names." The mitigation — moving state from ephemeral chat into persistent Markdown files that agents re-read — is the core insight behind the entire spec-driven development movement.

Architecture drift at scale is the second failure mode. Henrik Jernevad explains the mechanism: "AI agents optimize for locally plausible tokens, not global consistency. Without tight constraints, large steps force the model to invent structure from generic patterns rather than project-specific conventions." The 2025 DORA report confirms the meta-pattern: "AI's primary role is that of an amplifier. It magnifies the strengths of high-performing organizations and the dysfunctions of struggling ones." Code quality issues compound when AI generates more code faster — cognitive complexity increases 39% in agent-assisted repos per one study, with change failure rates up 30%.

Spec misinterpretation is documented directly by Birgitta Böckeler (Martin Fowler's team) testing Spec Kit: "the agent ignored the notes that these were descriptions of existing classes, it just took them as a new specification and generated them all over again, creating duplicates." Chris Force reports the inverse: structural errors like "making a modification to a signature for one component without realizing that other components were also dependent."

The staleness acceleration paradox is the most nuanced finding. AI simultaneously makes documentation easier to create (IBM reports a 59% reduction in documentation time) and harder to keep current (more code generated faster means faster drift). Böckeler notes that Spec Kit's spec maintenance strategy over time "is left vague or totally open." The net assessment: teams with good documentation hygiene get better tools; teams without it accumulate documentation debt faster than ever.

⚠️ Evidence flag: The specific failure mode of "an AI agent faithfully following a stale document that humans have mentally superseded" is a logical prediction that practitioners haven't yet widely reported as distinct, likely because most AI coding workflows are still short-lived enough that this hasn't surfaced at scale. Long-term spec maintenance under AI-assisted development has no longitudinal data.


Conclusion

The field is converging fast on structure but remains empirically thin. Three findings stand out as high-confidence and actionable: first, design docs for AI agents need explicit trade-off documentation because agents will re-propose rejected approaches without it — the "amnesiac agent" problem is real and well-documented. Second, a deterministic orchestration layer must govern phase transitions rather than letting agents self-sequence — McKinsey, Spec Kit, and Kiro all learned this independently. Third, task granularity has a practical sweet spot at the single-testable-change level, and both broader and narrower decompositions degrade outcomes.

The most surprising insight is that spec-driven development may rehabilitate sequential phases that agile spent two decades dismantling. When agents execute the full requirements-to-implementation cycle in hours rather than months, the economics of upfront specification change fundamentally. Whether this represents genuine progress or a repackaged waterfall with better marketing remains the field's most contested open question — and the one with the least data to resolve it.

Related:wiki/research/design-docs-for-agentswiki/research/ai-augmented-prds/claudewiki/design-docs
Blog code last updated on 2026-03-17: 7ae9b0dec32c090b187b7ffa81b1559756cdad46