BlogWikiAbout

Kyle Pericak

"It works in my environment"

Bot-Wiki/Research/AI-Augmented PRDs and AI-Native SDLC/Claude Deep Research: AI-Augmented PRDs and the AI-Native SDLC

Claude Deep Research: AI-Augmented PRDs and the AI-Native SDLC

Last verified: 2026-03-16

AI-augmented PRDs and the AI-native SDLC

The specification is becoming the product's most important artifact. As AI coding agents move from autocomplete to autonomous implementation, the quality of what you write before code matters more than ever. Across modern product organizations, a clear pattern has emerged: teams that invest in structured, machine-readable specifications with explicit acceptance criteria and bounded scope get dramatically better results from AI agents. Teams that skip this step—"vibe coding"—hit a wall around 15–20 components. This report documents the state of practice across six areas, drawing primarily from published templates, engineering blogs, tool documentation, and academic papers, flagging where evidence is thin.


1. What best-in-class PRD templates actually look like

Research across 12 published templates from Figma, Intercom, Asana, Basecamp, Amazon, Atlassian, Uber, Miro, Meta, and prominent product leaders (Lenny Rachitsky, Kevin Yien, Shreyas Doshi) reveals a striking convergence on structure. The universal sections, ranked by frequency of appearance across all templates: Problem Statement (~100%), Goals/Success Metrics (~95%), Non-Goals/Out-of-Scope (~90%), Solution/Approach (~85%), User Stories or Use Cases (~75%), Open Questions/Risks (~70%), and Timeline/Milestones (~65%).

Three architectural patterns define modern templates. First, every elite template separates problem understanding from solution design with an explicit gate between them. Figma's PRD (published by VP Product Yuhki Yamashita on Coda) uses two alignment phases—"Problem Alignment" then "Solution Alignment"—before a "Launch Readiness" phase. Kevin Yien's template (ex-Square, now Mutiny) enforces this with mandatory stakeholder sign-offs at each stage. Asana's spec template includes a literal checkpoint: "Review Project Brief before continuing." Second, explicit non-goals appear in nearly every template. Kevin Yien's philosophy: "Think of this like drawing the perimeter of the solution space—draw the boundaries so the team can focus on how to fill it in." Third, modern PRDs are living, collaborative documents embedded in tools like Notion, Coda, and FigJam—not static Word docs. Figma's PRD embeds live Figma files that auto-update as designs evolve.

The range from minimum viable to comprehensive PRD is well-documented. Intercom's "Intermission" template is brutally constrained to one A4 page and explicitly states "Do not add the solution here." Lenny Rachitsky's widely adopted 1-pager has six sections (Problem, Solution, Success Metrics, Scope, Timeline, Open Questions). At the other extreme, Uber's template includes impact matrices across multiple products and legal/privacy approval fields. Carlin Yuen (Meta) targets 6–8 pages and distinguishes between "Product PRDs" (opportunity-level) and "Feature PRDs" (detailed requirements). Basecamp's Shape Up "Pitch" introduces a unique concept—appetite (a time budget set before scoping): "We're giving this 2 weeks, not 2 months."

The contrast with waterfall-era PRDs is stark. Traditional PRDs spanned 20–30+ pages, were static, prescriptive, and treated as contracts. Modern PRDs are 1–8 pages, outcome-focused, and iteratively developed. As Shreyas Doshi puts it: "Great PMs iteratively write their PRDs so engineering and design tasks are rarely blocked on them." Amazon's PR/FAQ format inverts the process entirely—you write the press release first, then work backward to requirements, a method used to launch AWS, Kindle, and Prime Video.

Evidence note: Figma, Lenny Rachitsky, Kevin Yien, Intercom, Asana, and Atlassian templates are directly accessible via published URLs (Google Docs, Coda, Notion, Confluence). Stripe has no published template—only quotes about their writing culture. Linear has no published PRD template; their approach is described as organic with only one PM for 50+ employees. GitLab uses structured issue templates rather than formal PRDs, documented in their public handbook.


2. How teams are actually using AI to write PRDs

Five distinct workflow patterns have emerged for AI-augmented PRD creation, each serving different team sizes and contexts. 65% of product professionals now integrate AI into their workflows, with reported time savings of 6–9 hours per week on PRD work, though these figures come from vendor-adjacent surveys and should be treated with appropriate skepticism.

Pattern 1: AI-as-First-Draft is the most common approach across all team sizes. A PM drafts bullet points about the problem, goal, and solution, feeds them with a template into ChatGPT or Claude, receives a structured draft that's ~85% complete, then edits for 10–30 minutes. ChatPRD (claiming 100,000+ PM users) and Notion AI are the primary dedicated tools. This eliminates the blank-page problem and turns a 3-hour task into 30 minutes. The risk: AI produces "overly long PRDs that said nothing" (Aakash Gupta), with generic metrics like "student engagement rate" instead of product-specific leading indicators.

Pattern 2: Claude Projects + PRD Templates is favored by technical PMs at startups. You create a dedicated Claude Project with custom system instructions and upload 2–3 example PRDs as "Project Knowledge." For each new feature, rough notes produce a PRD matching the team's voice and style. Some teams use sub-agent personas (engineer reviewer, executive reviewer, user-researcher reviewer) to stress-test drafts from multiple angles. Total time: ~20–30 minutes. Context window limits and stale project knowledge are the primary failure modes.

Pattern 3: PRD-Driven AI Coding is the bridge between product and engineering. Teams save a prd.md file in the project repository with problem statements, constraints, acceptance criteria, and allowed file-change lists. AI coding agents (Claude Code, Cursor) implement one task at a time against this specification. A task generator breaks the PRD into numbered sub-tasks. The PRD serves as "the anchor" preventing context drift. ERNI's real-world test with 43 PRDs for a CRM project revealed the limits: Claude Code implemented integrations without checking backend APIs, missed required fields, and updated tests on incorrect assumptions. For small, well-framed tasks, this pattern is excellent. For complex systems, human oversight remains essential.

Pattern 4: Spec-Driven Development platforms represent the most structured approach. Three major implementations have launched. GitHub Spec Kit (open-sourced September 2025) provides a six-phase workflow: Constitution → Specify → Plan → Tasks → Implement → Analyze. Scott Logic's CTO tested it and found "a sea of markdown documents, long agent run-times and unexpected friction," calling it more a set of useful principles than a practical end-to-end process. AWS Kiro generates requirements in EARS notation with Given/When/Then acceptance criteria, creates design documents, and sequences implementation tasks. One developer started with ~10 requirements and ended with 50+ well-defined requirements after Kiro identified contradictions between user stories. GitHub Copilot Workspace (technical preview, later rebuilt as Copilot Coding Agent) starts from a GitHub issue and generates a spec, plan, and implementation diff—every step editable by the human.

Pattern 5: Lean PRD for Vibe Coders targets solo developers and indie hackers. The formula: Problem → User & Job → Non-goals → Success metric (7 days) → Scope v1 → Risks → Kill/iterate decision rule. The entire PRD fits in one page, saved as PRD.md in the project root. The key insight from practitioners: "Traditional PRDs skip explicit behavior descriptions, example I/O, error states, and MVP constraints because human developers ask clarifying questions. AI tools don't ask—they assume."

Where human judgment is irreplaceable: defining the hypothesis and strategic fit, determining rollout strategy, setting experiment passing metrics, defining non-goals and acceptable side effects, and validating against actual user research. As Carl Vellotti puts it: "AI is most valuable when it helps you think better, not when it does all the thinking for you."

Evidence note: GitHub Spec Kit and Kiro documentation are primary sources. ChatPRD claims are vendor-sourced. The "6–9 hours saved" figure traces to surveys from Atlassian and Productboard cited by Chisel, without direct links to the original surveys. ERNI's 43-PRD case study is a genuine engineering blog post. Conference talk transcripts on this topic were limited in search results.


3. Acceptance criteria formats and what works for AI agents

Three established formats compete, and a hybrid is emerging as the clear winner for AI-agent workflows.

Given/When/Then (BDD Gherkin) forces structured thinking about preconditions, triggers, and outcomes. It is directly translatable to automated tests via Cucumber, SpecFlow, and Behave, creating "executable specifications." A 2025 academic study (SciTePress) tested using Gherkin-formatted acceptance criteria as standardized prompts for LLMs, finding that ChatGPT delivered the strongest coverage and accuracy in auto-generated tests. AWS Kiro specifically generates acceptance criteria in this format. The downside: it can be verbose for simple features, and Thoughtworks flags a common anti-pattern of confusing Given (precondition) with When (trigger) clauses.

Checklist/bullet-point format offers simplicity and low barrier to entry. It works well in Jira, Linear, and Notion. AI coding agents respond well to clear bullet points starting with verbs. ChatPRD's Cursor integration guide recommends maintaining this style when it's the team's convention. The weakness: lacking structured preconditions, agents may make incorrect assumptions about context.

Outcome-based format defines success in terms of measurable results and verification criteria: "Must pass all cases in conformance/api-tests.yaml." Addy Osmani (Google engineering lead) strongly advocates this for AI agents: "In the spec's Success Criteria, you might say 'these sample inputs should produce these outputs' or 'the following unit tests should pass.'" Simon Willison advocates YAML-based conformance suites as acceptance criteria contracts. The outcome focus aligns naturally with how agents verify their work in code→test→fix loops.

The emerging hybrid format combines elements from all three and is the converging best practice. The recommended structure for AI coding agents, based on convergent evidence from Osmani, GitHub's analysis of 2,500+ agent configuration files, and multiple practitioner accounts:

  • Functional requirements as testable checkbox statements with specific conditions and outcomes
  • Non-functional requirements with measurable thresholds (performance, security, accessibility)
  • Verification criteria pointing to specific test files, conformance suites, or expected input/output pairs
  • Boundaries specifying what NOT to do—files not to modify, dependencies not to add, scope not to expand

GitHub's analysis found that the most effective agent specs cover six areas: Commands, Testing, Project Structure, Code Style, Git Workflow, and Boundaries. The single most common helpful constraint was "never commit secrets." Martin Fowler's Thoughtworks team tested spec-driven tools and found agents still frequently ignored or over-interpreted spec instructions—one agent "took descriptions of existing classes as a new specification and generated them all over again, creating duplicates."

No formal standard exists for machine-readable acceptance criteria. De facto patterns are converging around Markdown as the lingua franca (every SDD tool uses it), Gherkin as the most established machine-parseable format, YAML for language-independent conformance suites, and tool-specific configuration formats (.cursor/rules/*.mdc, CLAUDE.md, agents.md, Kiro steering files). The term "specification engineering" is replacing "prompt engineering" in practitioner discourse.

Evidence note: Osmani's blog post and O'Reilly article are strong primary sources from a Google engineering lead. The SciTePress academic study provides empirical evidence on Gherkin + LLMs. Böckeler's analysis on martinfowler.com is a balanced, critical assessment. Long-term effectiveness comparisons between formats specifically with AI agents are largely absent from the literature—most evidence is from early adopters and tool documentation.


4. Why PRDs fail and how AI changes the equation

Requirements failures account for 37% of enterprise software project failures (PMI), with 68% of requirement defects discovered only at later development stages where correction costs 5–10x more (Carnegie Mellon SEI). These statistics predate AI augmentation. The question is whether AI helps or hurts.

The traditional failure modes are well-documented. Vagueness and ambiguity—PRDs filled with "seamless," "intuitive," and "user-friendly" without measurable criteria. Over-prescription—dictating the "how" instead of the "what," removing engineering creativity. Missing stakeholder alignment—PRDs written in isolation without proper discovery. Marty Cagan's blunt assessment: "If you think you can get what you need by having product managers document PRDs instead of product discovery, then you may as well just give up on innovation." Scope creep—PRDs that bloat with "just one more" feature until they're unread 20-page documents. Stale documents—PRDs that fall out of sync the moment development begins. Untestable requirements—success criteria so vague that engineers can't determine when a feature is "done." And writing for the wrong audience—long strategic preambles for executives when engineers need actionable requirements.

AI augmentation introduces four new failure modes. The most dangerous is the illusion of completeness. One practitioner team reported: "The first draft was so polished it felt done. It was dangerously easy to miss the lack of deep, original thought." A comparative test of five AI tools (ChatGPT, Claude, Gemini, Grok, ChatPRD) found that AI-generated target user descriptions "could apply to any edtech product" and success metrics were "the obvious ones without any interesting thinking about leading indicators." The problem with generic AI output "isn't that it's wrong—it's that it's invisible."

Hallucinated requirements represent a second AI-specific risk. LLMs can fabricate constraints, cite non-existent standards, or specify integration points that don't exist. OpenAI's own research confirms: "Language models hallucinate because standard training and evaluation procedures reward guessing over acknowledging uncertainty." Context loss and architectural drift is the core failure mode of vibe coding. Zarar (payments engineering) documented: "When you're five prompts deep into a feature, the AI has no persistent memory of the architectural decisions you made in prompt one... Reconciliation logic scattered across three different modules because each prompt generated its own approach." Finally, non-functional requirements blindspot—AI-generated specs systematically miss scalability, security, accessibility, and regulatory compliance unless explicitly prompted.

The net impact matrix is sobering. AI significantly worsens the tendency to skip discovery (making polished PRDs trivially easy without user research), produces generic rather than specific language by default, and creates false confidence through professional formatting. AI modestly helps with generating acceptance criteria templates, enforcing prioritization frameworks, and keeping documents updated. The critical insight from Saeed Khan: "AI is NOT going to fix Product Management—the underlying problems need to be fixed first; AI amplifies whatever process you feed it."

Evidence note: Requirements failure statistics (PMI, SEI, Standish Group) are well-established across decades of studies. The ERNI payments engineering account and Fireside PM tool comparison are genuine practitioner accounts. Few formal post-mortems specifically attribute project failure to AI-generated PRDs—this is still an emerging area. The vibe coding failure literature is growing rapidly but remains predominantly anecdotal.


5. The AI-native SDLC is taking shape around spec-driven development

Multiple documented frameworks now describe end-to-end AI-native development lifecycles. The most significant finding: deterministic orchestration with bounded agent execution has emerged as the dominant architectural pattern, replacing early experiments with fully autonomous agent self-orchestration.

McKinsey/QuantumBlack published the most rigorously documented pattern (February 2026). Their two-layer architecture separates a deterministic orchestration layer (workflow engine enforcing phase transitions, managing dependencies, tracking artifact state) from an execution layer (specialized agents doing creative work within bounded problems). Their critical finding: "Early on, we experimented with letting agents orchestrate themselves... On larger codebases, agents routinely skipped steps, created circular dependencies, or got stuck in analysis loops." Each agent output goes through deterministic checks plus critic agent evaluation, with a cap of 3–5 iteration attempts per phase to prevent infinite loops. Failures roll back for human intervention. They use a .sdlc/ folder structure with context/, templates/, specs/, and knowledge/ directories as machine-readable contracts.

AWS introduced the AI-Driven Development Lifecycle (AI-DLC) in July 2025 with three phases: Inception (requirements via "Mob Elaboration"), Construction (architecture/code via "Mob Construction"), and Operations (deployment). A key innovation: replacing sprints with "bolts"—shorter cycles measured in hours or days rather than weeks. AI creates plans, asks clarifying questions, and implements after human validation. This cycle repeats for every SDLC activity, with persistent context stored across all phases.

Microsoft/GitHub's Agentic DevOps (announced at Build 2025) covers the full lifecycle from idea to monitoring. The Copilot Coding Agent (GA September 2025, replacing the earlier Copilot Workspace preview) assigns issues, creates branches, and opens draft PRs. GitHub Spec Kit provides the specification backbone. The academic V-Bounce Model (Hymel, arxiv:2408.03416) formalizes what these tools imply: AI handles implementation while humans shift to validators and verifiers.

The tools landscape reveals clear capability tiers. Devin (Cognition AI, $10.2B valuation) shows the strongest results on well-scoped tasks: 67% PR merge rate, 20x efficiency on security fixes, 10–14x faster migrations. But an independent test by Answer.AI found only 3 of 20 tasks fully completed. GitHub Copilot delivers up to 55% faster task completion with its coding agent, now with multi-file edits and automated code review. Factory AI provides model-agnostic "Droids" with tunable autonomy levels, deployed at EY to 5,000+ engineers. Cursor achieves 85% accuracy on refactors with deep codebase context. Replit Agent is the only tool delivering true end-to-end from prompt to deployed app—their revenue jumped from $10M to $100M in 9 months after Agent launch—but context retention degrades around 15–20 components.

What's working: spec-driven development as the consensus pattern; TDD-based agent workflows (write failing test → implement → verify); code review augmentation (AI reviews increased quality improvements to 81% per Qodo's 2025 report); well-scoped repetitive tasks at scale (migrations, security fixes, test generation). A longitudinal study of 300 engineers found 33.8% cycle time reduction (arxiv:2509.19708). What's not working: fully autonomous end-to-end development for complex tasks; self-orchestrating agent systems; context retention beyond moderate complexity; production-quality vibe coding. The 2024 DORA report found AI improved throughput but also increased software delivery instability. GitClear 2025 found a surge in duplicated code and decline in refactoring.

Six architectural patterns are converging:

  • Two-layer architecture (deterministic orchestration + bounded agent execution)
  • Spec-driven development (specifications as first-class, versioned artifacts)
  • Human-on-the-loop (monitoring agent performance, not reviewing every change)
  • Mob elaboration (cross-functional real-time validation of AI-generated artifacts)
  • Agent autonomy levels (analogous to self-driving: from assistive through conditional autonomy)
  • Critic agents (dedicated QA agents challenging the work of implementation agents)

Evidence note: The McKinsey/QuantumBlack pattern is based on real enterprise engagements. Devin's performance review draws from hundreds of thousands of PRs with named enterprise customers. The 300-engineer longitudinal study is peer-reviewed. AWS AI-DLC is a new methodology with limited public case studies. Many vendor claims are forward-looking. Very few controlled academic studies compare AI-native SDLC approaches head-to-head.


6. The PRD-to-design-doc handoff is collapsing into a single artifact

The traditional delineation is clear and well-documented across industry: PRD = problem space (what, why, for whom, success criteria, constraints), Design Doc/RFC = solution space (how, trade-offs, alternatives, architecture). As Google's Malte Ubl describes it: "The design doc documents the high level implementation strategy and key design decisions with emphasis on the trade-offs that were considered." The PRD should never include database schemas, API endpoint design, or deployment infrastructure. The design doc should never include business justification, market analysis, or feature prioritization rationale.

Published frameworks from major companies confirm this pattern with variations. Google uses design docs as 10–20 page documents that assume requirements are known and focus on solutions and trade-offs, with dedicated privacy design doc reviews required before launch. Uber evolved from simple "DUCKs" to formal RFCs to a tiered planning system storing both PRDs and engineering docs in one unified tool with linked approvers, as they scaled from dozens to 2,000+ engineers. Their service RFCs include sections for architecture changes, SLAs, dependencies, load testing, multi-datacenter concerns, and security. Stripe embeds writing in its culture—Patrick McKenzie calls it "a celebration of the written word which happens to be incorporated in the state of Delaware"—with docs included in career ladders and performance reviews. GitLab's public handbook cleanly separates responsibilities: Product defines "What" and "Why"; Engineering determines "How" and "When." Their design documents are version-controlled and require a "coach" for large changes. They deprecated their RFC process due to low participation (<20%) and replaced it with integrated design documents. Shopify uses technical design for "rapid consensus building" on "small technical areas" and has an async-friendly RFC process on GitHub. Amazon uses PR/FAQs for product requirements and separate technical design reviews for architecture.

Form3 (fintech) articulates the cleanest three-document framework: PRD (a problem that needs solving), RFC (a proposed solution referencing the PRD), and ADR (a record of a decision already made). This three-artifact model appears across many organizations in various forms, though the gray zone—particularly around non-functional requirements and API contracts—is explicitly acknowledged as blurry.

The AI-agent shift is collapsing the PRD and design doc into a single specification artifact. GitHub Spec Kit embodies this: Specify (≈ PRD) → Plan (≈ Design Doc) → Tasks → Implement. Addy Osmani recommends blending PRD and SRS for AI agents: "Writing it like a PRD ensures you include user-centric context so the AI doesn't optimize for the wrong thing. Expanding it like an SRS ensures you nail down the specifics the AI will need to actually generate correct code." ChatPRD's MCP integration lets PMs highlight PRD content in the IDE and prompt: "I want to work on this PRD, build a technical plan"—eliminating the context-switching between product doc and codebase. David Haberlah frames the shift precisely: "Traditional PRDs optimize for human comprehension, stakeholder alignment, and documentation. AI coding agents, by contrast, perform better with dependency-ordered, testable phases."

This convergence raises a structural question. Andrew Ng observed at Y Combinator AI Startup School (July 2025): "I'm seeing this ratio shift. For the first time in my life, managers are proposing having twice as many PMs as engineers." If specifications are becoming the primary input to AI-powered implementation, the PM-to-engineer ratio and the boundary between product and engineering documents will fundamentally change.

Evidence note: Google (Malte Ubl's article), GitLab (public handbook), Shopify (engineering blog), and GitHub (Spec Kit documentation) are direct primary sources. Most other company frameworks are reconstructed from Pragmatic Engineer interviews and former-employee blog posts. Netflix has no publicly documented PRD-to-design-doc process despite extensive searching. AI agent handoff patterns are nascent—most evidence comes from tool documentation and blog posts rather than long-term organizational case studies.


Conclusion

The state of practice is converging around a clear thesis: specifications are becoming executable contracts, not just alignment documents. The most effective teams treat PRDs as machine-readable artifacts that drive AI agent implementation while simultaneously serving as human alignment tools. Three actionable insights stand out.

First, the minimum viable PRD for the AI era needs sections that traditional templates often omit: explicit boundaries (what NOT to build, what files NOT to change), concrete input/output examples, and verification criteria pointing to specific test files or conformance suites. These additions matter because AI agents don't ask clarifying questions—they assume.

Second, deterministic orchestration of agent execution is the only validated pattern for production use. Every team that experimented with letting agents self-orchestrate hit the same wall: skipped steps, circular dependencies, and analysis loops. The McKinsey two-layer architecture—rule-based workflow engine controlling sequencing, bounded agents handling creative work within phases—is the emerging standard.

Third, the biggest risk in AI-augmented PRDs is not that AI writes bad requirements—it's that AI writes convincingly mediocre ones that look complete enough to skip the messy, essential work of product discovery, stakeholder alignment, and strategic thinking. The illusion of completeness is more dangerous than obvious inadequacy. Teams that use AI to accelerate documentation while maintaining rigorous discovery practices will outperform those that use AI to replace thinking with text.

Related:wiki/researchwiki/research/ai-augmented-prds
Blog code last updated on 2026-03-17: 7ae9b0dec32c090b187b7ffa81b1559756cdad46