Your interviewer pattern is the right foundation—but the human interviewer is the bottleneck to eliminate. Production systems like compound-product already generate PRDs fully autonomously using a "self-clarification protocol" that replaces human Q&A with agent self-interrogation. The research consensus (2025–2026) strongly favors structured planning over single-prompt approaches: FeatureBench data shows 74% success for scoped bug fixes dropping to just 11% for complex feature development without structured decomposition. The optimal path forward combines your existing AgentTask CRD with compound-product's autonomous PRD generation, Ralph Loop execution, and the new kubernetes-sigs/agent-sandbox CRD for isolated agent workloads—all orchestrated through your Go reconcile loop.
Your PRD/DD interview approach is now the dominant paradigm in AI-assisted development. The BMAD Method (19,000+ GitHub stars, MIT licensed, v6.2) codifies exactly this pattern: a PM agent asks 20+ structured questions about assumptions, constraints, personas, data flows, SLAs, NFRs, risks, and test strategy before execution begins. Practitioners report the planning phase takes 4–5 hours by design, and this investment dramatically reduces execution failures.
The critical question is whether the human interviewee can be replaced. Three production systems already demonstrate this:
compound-product (snarktank) implements the most complete autonomous PRD pipeline. Its prd skill explicitly instructs the agent not to ask clarifying questions. Instead, it runs a self-clarification protocol, answering five questions internally from available context: What problem does this solve? What are the 2–3 key actions? What should this NOT do? How do we verify it works? What constraints exist? Context comes from an analysis.json generated in a prior analysis phase, plus AGENTS.md files containing codebase patterns. Output is a standardized 8-section PRD with numbered tasks and boolean pass/fail acceptance criteria that are autonomously verifiable.
BMAD Party Mode enables multiple agent personas (PM, Architect, Product Owner, QA) to collaborate in a single session. The BMad Master orchestrator selects 2–3 relevant agents per exchange, and agents agree, disagree, and build on each other's ideas. While currently human-steered, wrapping Party Mode in an outer orchestration layer (feeding one agent's output as input to the next) creates fully autonomous AI-to-AI planning.
MetaGPT implements multi-agent PRD automation with specialized researcher, planner, reviewer, and refiner agents iterating in structured loops. IBM has published tutorials for running this with DeepSeek/Ollama for local inference.
The tradeoff of increasing question count when patience is no constraint is favorable but bounded. BMAD's PO asks 20+ questions; an AI-to-AI system could ask 100+ without fatigue. Practitioner Kovyrin reports using the largest available models for PRD creation ("absolutely no reason to save on tokens") and spending 20–30 minutes on voice-dictated context dumps. The risk is diminishing returns—quality is ultimately bounded by available context (codebase, existing docs, domain knowledge), not question count. The recommended hybrid: AI-to-AI for 90% of planning (technical feasibility, architecture, edge cases, testing), human intervention for product vision, business priorities, and risk tolerance.
Claude Code (already in your stack) should remain the primary execution agent. It holds the highest SWE-bench Verified score at 80.9% (Opus 4.5), now runs headless via claude -p for k8s job execution, has native subagent spawning for parallel work, checkpointing for crash recovery, and Linear MCP integration you're already using. The Agent SDK (Python/TypeScript, released June 2025) provides the same agent loop and tools as the CLI. Agent Teams (Feb 2026) enable multi-agent coordination, and the native /loop command (March 2026) absorbs the Ralph pattern directly.
OpenHands (formerly OpenDevin, 38,800+ stars, MIT) is the strongest complement to Claude Code. Its event-stream architecture with Docker-based sandboxed runtimes maps naturally to k8s pods. The enterprise edition explicitly supports Kubernetes self-hosting. Its SDK enables programmatic orchestration—ideal for integration with your Go reconcile loop. It's model-agnostic, so it provides a fallback if Anthropic rate limits become a bottleneck. Use OpenHands for sandboxed PoC deployment testing where you don't want untrusted code running with full cluster access.
SWE-agent mini (~100 lines of Python, bash-only tools) is trivially deployable as k8s job pods. It achieves >74% on SWE-bench Verified and starts much faster than Claude Code. Use it for lightweight, targeted investigation tasks where Claude Code's full context window is unnecessary.
GetShitDone is not a coding agent—it's a browser distraction blocker extension (getshitdone.one) and an unrelated SME CRM platform (getshdone.ai). Neither is relevant to your pipeline. Devin is cloud-only with no self-hosting option, making it unsuitable for k8s-native integration despite its high autonomy. Factory AI has strong Terminal-Bench performance (58.8%) and Linear integration but is proprietary with no self-hosting. Aider (42,000+ stars) is an excellent pair programmer with git-native workflow but lacks fully autonomous headless operation. Sweep is JetBrains-focused and not pipeline-suitable.
Temporal is the strongest orchestration fit for your infrastructure. Its Go SDK is native to your reconcile loop language. Self-hosted Temporal Server runs on k8s. Workflows survive crashes and can run for days or weeks—critical for multi-hour investigation tasks. Activities wrap Claude Code SDK calls. Signals enable your AgentTask CRD to communicate with running workflows. Temporal Schedules handle cron-based trending tool discovery. Production-proven at NVIDIA and Uber scale.
LangGraph (v1.0, late 2025) is the best AI-specific orchestration framework. Its graph-based state machines with conditional edges map perfectly to your pipeline's branching logic (skip PoC if tool is irrelevant). Built-in checkpointing with "time travel" state replay, durable execution, and LangSmith observability make it production-hardened. 27,100 monthly searches—the dominant framework. The tradeoff: it's Python-native, requiring a Python sidecar or bridge in your Go-based stack.
CrewAI offers the lowest barrier to entry for prototyping role-based agent teams (Discoverer → Investigator → Builder → Evaluator → Writer). YAML-driven agent definitions. Good for validating the pipeline concept quickly, but teams commonly migrate to LangGraph for production. AutoGen is in maintenance mode—avoid for new development.
Two projects directly complement your AgentTask CRD:
kubernetes-sigs/agent-sandbox (official K8s SIG Apps project, v0.1.0, March 2026) introduces Sandbox CRDs purpose-built for AI agent workloads. Unlike Deployments or StatefulSets, Sandboxes are singleton, stateful pods with persistent identity, storage that survives restarts, gVisor/Kata isolation for untrusted LLM-generated code, and scale-to-zero hibernation. SandboxTemplate defines reusable agent archetypes, SandboxClaim enables programmatic allocation, and SandboxWarmPool keeps pre-warmed pods ready for fast startup. This replaces your need to manually configure pods for agent workloads.
kagent (CNCF Sandbox, Solo.io) provides a Kubernetes-native agent framework with CRD-defined agents, MCP tool integration (kubectl, Prometheus, Helm, Argo, Grafana), and OpenTelemetry tracing. Pre-built agents include k8s-agent, helm-agent, and argo-rollouts-agent. Use kagent for the validation/review layer of your pipeline—agents that verify generated manifests, check metrics, and validate deployments.
The "just give it a high-level goal" approach will not work reliably for your pipeline. The evidence is overwhelming and quantitative.
FeatureBench (February 2026) provides the definitive benchmark: Claude Opus 4.5 achieves 74.4% on SWE-bench (scoped single-PR bug fixes) but only 11.0% on complex feature development. That's a 7x success rate collapse as task complexity increases. Your pipeline—investigate, deploy PoC, evaluate, document—is closer to feature development than bug fixing.
Meta-Task Planning research shows structured hierarchical planning achieves 42.68% on the TravelPlanner benchmark versus 2.92% for unstructured approaches—a 14x improvement from planning alone. Google DeepMind's "Towards a Science of Scaling Agent Systems" (December 2025) found that adding agents without deliberate coordination topology leads to worse performance, with accuracy saturating after ~4 agents without proper structure.
Documented Claude Code failure modes at high complexity include premature task completion (generates a plan, executes first phases, then stops and summarizes as if done), progress reversion on 50–60+ step tasks (drops back to previous prompt), and context degradation where earliest information is lost once the context window exceeds half capacity. Andrej Karpathy notes: "The models make wrong assumptions on your behalf and run with them without checking. They don't manage confusion, don't seek clarifications, don't surface inconsistencies."
Anthropic's own data shows the deployment overhang: while Claude Opus 4.5 can theoretically handle tasks taking a human ~5 hours, the longest practical Claude Code turns are ~42 minutes. Average human interventions per session decreased from 5.4 to 3.3 between August and December 2025—improving but far from zero. Experienced users (750+ sessions) interrupt more often than novices (9% vs 5% of turns), suggesting expertise leads to more active monitoring, not less.
The sweet spot is Level 3 autonomy: the agent plans and executes autonomously within defined scope, with human oversight at key phase transitions. For your pipeline specifically: provide structured phase boundaries and success criteria (not a full PRD, but clear gates), use Claude Code's Plan Mode for investigation (read-only), break execution into explicit phases with review points, and keep tasks to 3–5 steps max per phase.
Goal: Replace human interviewer with compound-product self-clarification protocol.
AgentTask CRD additions:
spec:
phase: planning | execution | evaluation | documentation
prdSource: "" # empty = generate autonomously
goalStatement: "..." # one-sentence product intent
contextRefs: # existing docs, AGENTS.md paths
- path: docs/stack-overview.md
- path: AGENTS.md
maxPlanningIterations: 5
costLimit: "$10.00"
prd skill, adapt self-clarification questions to your domain (k8s PoC evaluation, stack compatibility assessment, documentation requirements)AgentTask.spec.phase == planning and prdSource is empty, spawn a Claude Code headless session with the self-clarification promptprd.json committed to feature branch in monorepoKey files to create in github.com/kylep/multi:
agents/planning/self-clarify-prompt.md — domain-specific self-clarification templateagents/planning/prd-template.md — standardized PRD format with k8s-specific sectionsinternal/controller/phases/planning.go — planning phase reconcile logicGoal: Implement isolated agent execution using agent-sandbox CRDs and Ralph pattern.
# Agent execution sandbox
apiVersion: agents.x-k8s.io/v1alpha1
kind: Sandbox
metadata:
name: poc-evaluator-{{ .task.name }}
spec:
runtimeClassName: gvisor
containers:
- name: claude-agent
image: ghcr.io/kylep/multi/agent-runner:latest
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: anthropic-credentials
key: api-key
- name: PRD_PATH
value: /workspace/prd.json
volumeMounts:
- name: workspace
mountPath: /workspace
storage:
size: 20Gi
kubernetes-sigs/agent-sandbox CRDs on your k3s clusteragent-runner container image containing Claude Code CLI + Ralph loop script + git configurationAgentTask.spec.phase == executionprd.json, implements tasks, commits to feature branch--max-cost per task, --timeout 45m per iterationprd.json task completion status from reconcile loop; when all tasks pass → phase transitionKey files:
deploy/agent-sandbox/ — Helm chart for agent-sandbox operatorimages/agent-runner/Dockerfile — Claude Code + Ralph + git-syncimages/agent-runner/ralph-config.json — loop configurationinternal/controller/phases/execution.go — execution phase reconcile logicinternal/controller/sandbox.go — Sandbox CRD lifecycle managementGoal: Agent evaluates PoC against existing stack and generates wiki/blog documentation.
Evaluation phase uses a structured rubric committed as a ConfigMap:
apiVersion: v1
kind: ConfigMap
metadata:
name: evaluation-rubric
data:
rubric.yaml: |
criteria:
- name: k8s_native
weight: 0.25
question: "Does it deploy as native k8s resources manageable by ArgoCD?"
- name: stack_overlap
weight: 0.20
question: "Does it replace or conflict with existing stack components?"
- name: operational_complexity
weight: 0.20
question: "What is the operational burden (resource usage, maintenance)?"
- name: value_add
weight: 0.20
question: "What capability does it add that the current stack lacks?"
- name: community_health
weight: 0.15
question: "Is the project actively maintained with healthy community?"
pass_threshold: 0.6
evaluation.json with scores, rationale, and recommendationapps/blog/) and wikikyle.pericak.com via Cloudflare TunnelKey files:
agents/evaluation/rubric.yaml — evaluation criteria templateagents/evaluation/eval-prompt.md — evaluation agent instructionsagents/documentation/blog-template.md — blog post structureinternal/controller/phases/evaluation.go — evaluation phase logicinternal/controller/phases/documentation.go — documentation phase logicGoal: Close the loop with automated discovery feeding the pipeline.
Discovery agent runs on a Temporal Schedule (daily or weekly):
// Temporal workflow definition
func DiscoverTrendingTools(ctx workflow.Context) error {
// Activity 1: Search GitHub trending, HN, Reddit, CNCF landscape
var discoveries []ToolCandidate
workflow.ExecuteActivity(ctx, SearchTrendingSources, opts).Get(ctx, &discoveries)
// Activity 2: Filter against existing stack (deduplicate, relevance check)
var candidates []ToolCandidate
workflow.ExecuteActivity(ctx, FilterCandidates, discoveries, opts).Get(ctx, &candidates)
// Activity 3: Create AgentTask CRDs for top candidates
for _, candidate := range candidates[:3] { // max 3 per cycle
workflow.ExecuteActivity(ctx, CreateAgentTask, candidate, opts)
}
return nil
}
temporalio/temporal)SandboxWarmPool with 2 pre-warmed pods for fast pipeline startupKey files:
internal/temporal/workflows/discover.go — discovery workflowinternal/temporal/activities/search.go — trending source searchinternal/temporal/activities/filter.go — candidate filteringdeploy/temporal/ — Helm values for Temporal Serverinternal/controller/phases/discovery.go — discovery phase reconcile logic┌──────────────────────────────────────────────────────────┐
│ Temporal Schedule (daily) │
│ → Discovery Workflow │
│ → Search GitHub trending, HN, CNCF landscape │
│ → Filter against existing stack │
│ → Create AgentTask CRDs (max 3/cycle) │
└──────────────┬───────────────────────────────────────────┘
│
┌──────────────▼───────────────────────────────────────────┐
│ Go Reconcile Loop (watches AgentTask CRDs) │
│ │
│ Phase 1: PLANNING │
│ → Spawn Claude Code headless in Sandbox │
│ → Self-clarification protocol → PRD + prd.json │
│ → Human approval gate via Linear MCP │
│ │
│ Phase 2: EXECUTION │
│ → Create agent-sandbox Sandbox (gVisor isolated) │
│ → Ralph Loop: iterate prd.json tasks │
│ → Circuit breaker + cost controls │
│ → Git commit to feature branch │
│ │
│ Phase 3: EVALUATION │
│ → Read-only Sandbox runs evaluation rubric │
│ → Structured scoring against existing stack │
│ → Pass/fail gate (threshold: 0.6) │
│ │
│ Phase 4: DOCUMENTATION │
│ → Generate blog post + wiki entry │
│ → Commit to apps/blog/ in monorepo │
│ → ArgoCD syncs → Cloudflare Tunnel → published │
│ │
│ Phase 5: CLEANUP │
│ → Scale Sandbox to 0 (hibernate) │
│ → Update Linear ticket with results │
│ → Archive evaluation artifacts │
└──────────────────────────────────────────────────────────┘
Start with a human-in-the-loop gate after Phase 1. The research is unambiguous: fully autonomous end-to-end execution has an 88% project failure rate in enterprise contexts. Begin with human PRD approval, then progressively relax gates as you build confidence in the self-clarification protocol's output quality. Track a metric: "PRD approval rate without modification" — when it exceeds 80% over 20+ runs, remove the gate.
Use gVisor isolation for all execution Sandboxes. Your pipeline evaluates untrusted open-source tools and generates code to deploy them. The agent-sandbox CRD's runtimeClassName: gvisor provides kernel-level isolation without the overhead of full VMs. This is non-negotiable for a pipeline that autonomously deploys unknown software to your cluster.
Budget API costs aggressively. Gas Town-scale multi-agent runs cost $50–100+/hour. Your pipeline should enforce per-task cost limits (--max-cost $10 for planning, $25 for execution, $5 for evaluation/documentation). Temporal's activity timeout enforcement provides a hard backstop. Track cost-per-evaluation as a key metric.
Monitor the compound-product and agent-sandbox maturity. compound-product was last indexed January 2026—verify active maintenance before production adoption. agent-sandbox is v0.1.0 alpha. Both are directionally correct but may require patching. The BMAD Method (v6.2, actively maintained, 19k stars) is the safest bet for the planning layer if compound-product stalls.
The Ralph Loop's coercive exit-prevention language has raised concerns (Claude Code Issue #23084 regarding "model welfare"). Anthropic's native /loop command (shipped March 2026) absorbs the core Ralph pattern with official support. Evaluate whether /loop is sufficient before adding the full ralph-claude-code wrapper—it reduces your dependency surface and aligns with Anthropic's evolving agent autonomy policies.
The most important finding across all four research questions is that architecture matters more than model choice. The same model scores 17 problems apart on different agent scaffoldings. Your instinct to build structured planning infrastructure (AgentTask CRD, interview patterns, phased execution) is validated by every production system and research paper examined. The teams succeeding with autonomous agents spend 70% of effort on problem definition and verification strategy, 30% on execution—your PRD/DD pattern encodes exactly this ratio.
The gap to close is not "more autonomy" but "smarter autonomy." Compound-product's self-clarification protocol, agent-sandbox's lifecycle management, and Temporal's durable workflows give you the building blocks to eliminate human bottlenecks at each phase boundary—selectively, measurably, and reversibly. The pipeline you've described (discover → investigate → PoC → evaluate → document) is achievable with today's tooling. The key is phased rollout with progressive autonomy gates, not a single leap to full automation.