Link to PRD: Autonomous Publisher Pipeline
The publisher pipeline (research, write, review, QA, security audit) has only ever run interactively on Kyle's MacBook. Moving it to K8s requires solving four problems the journalist agent never had to: Claude Max OAuth auth (replacing OpenRouter), Playwright browser verification in a container (the journalist has no QA step), branch lifecycle management (the journalist commits to main), and MCP server configuration for Playwright alongside the existing Discord and Google News servers.
The runtime image also needs a base image change: Playwright requires
glibc, and the current node:22-alpine base uses musl.
Goals:
CLAUDE_CODE_OAUTH_TOKENNon-Goals:
graph TD
subgraph "Kyle's Machine"
K[Kyle] -->|webhook / kubectl| WH[Controller Webhook]
end
subgraph "K8s: ai-agents namespace"
WH --> AT[AgentTask CRD
agent: publisher]
AT --> CTRL[Agent Controller]
CTRL -->|creates| JOB[K8s Job]
subgraph "Job Pod"
INIT[Init Container
alpine/git] -->|git sync| MAIN
MAIN[Entrypoint Script
bin/run-publisher.sh] -->|creates branch| CC[Claude Code
--agent publisher]
CC -->|subagent| RES[Researcher]
CC -->|subagent| REV[Reviewer]
CC -->|subagent via MCP| QA[QA + Playwright]
CC -->|subagent| SEC[Security Auditor]
CC -->|on success| GIT[git push + gh pr create]
CC -->|MCP| DISC[Discord Notification]
end
JOB --> PVC[(Per-Branch PVC
/workspace)]
NP[NetworkPolicy] -.->|allow non-RFC1918
block LAN| JOB
end
CC -->|CLAUDE_CODE_OAUTH_TOKEN| ANTHROPIC[api.anthropic.com
Claude Max]
style ANTHROPIC fill:#e8d5f5
style NP fill:#f5d5d5
ai-agent-runtime:0.4)Responsibility: Provide Claude Code CLI, Playwright, Chromium, and all MCP server dependencies in a single image.
File path: infra/ai-agent-runtime/Dockerfile
Key change: Base image switches from node:22-alpine to the
official Playwright image mcr.microsoft.com/playwright:v1.58.2-noble
which includes Chromium, system fonts, and glibc. Node.js 22 is
available in this image.
Key interfaces:
@playwright/mcp, Python 3, pip, MCP
server deps, Git, gh CLIpwuser (UID 1001, from Playwright base image)/bin/sh (unchanged)~/.claude.json with
{"hasCompletedOnboarding": true} baked into the imagebin/run-publisher.sh)Responsibility: Branch lifecycle, Claude Code invocation, git push, PR creation, and Discord notification. This is the command the controller runs instead of inline Claude Code invocation.
File path: apps/blog/bin/run-publisher.sh
Key interfaces:
$1 = topic/prompt stringCLAUDE_CODE_OAUTH_TOKEN, DISCORD_WEBHOOK_URL
(or uses Discord MCP), GITHUB_TOKEN (for gh pr create)agent/publisher-$(date +%s) from current HEADclaude --mcp-config /tmp/mcp.json --agent publisher -p "$1" --output-format textgit push, gh pr create --base main, posts PR link to
Discordbin/start-dev-bg.shResponsibility: Extended buildCommand() to support publisher-
specific configuration: Playwright MCP server in the MCP config JSON,
OAuth token injection, and gh CLI auth.
File path: infra/agent-controller/pkg/controller/controller.go
Key changes:
buildCommand() detects agent == "publisher" and uses the
entrypoint script instead of inline command construction{"playwright": {"type": "stdio", "command": "npx", "args": ["-y", "@playwright/mcp@latest", "--headless", "--no-sandbox"]}}/dev/shm emptyDir volume (required for Chromium)--init equivalent via shareProcessNamespace
or a proper init processcreateBranchPVC), cleaned up after job completion
(cleanupBranchPVCs). Read-only agents continue using the shared PVC.canRunWrite() serialization removed — per-branch PVCs eliminate
the need for write serializationgetInstallationToken)
provides scoped auth for git push and PR creationResponsibility: Store Claude Max OAuth token and GitHub token alongside existing secrets.
File path: infra/agent-controller/helm/templates/secret.yaml
Key additions:
CLAUDE_CODE_OAUTH_TOKEN -- the 1-year token from claude setup-tokenGITHUB_TOKEN -- PAT for gh pr create (or use gh auth login --with-token)Auth model change: For publisher runs, the controller must NOT
inject ANTHROPIC_API_KEY, ANTHROPIC_AUTH_TOKEN, or
ANTHROPIC_BASE_URL -- these conflict with CLAUDE_CODE_OAUTH_TOKEN.
The entrypoint script unsets them before invoking Claude Code.
Responsibility: Restrict agent pod network egress to non-RFC1918 addresses only (following the OpenClaw pattern).
File path: infra/agent-controller/helm/templates/networkpolicy.yaml
Key interfaces:
0.0.0.0/0 except 10.0.0.0/8,
172.16.0.0/12, 192.168.0.0/16agents.kyle.pericak.com/agentpublisher-manual.yaml)Responsibility: Declarative task definition for manual publisher runs.
File path: infra/agent-controller/config/samples/publisher-manual.yaml
Key fields:
agent: publishertrigger: manualreadOnly: falseallowedTools: full set including Bash, Read, Write, Edit, Glob,
Grep, Agent, and all mcp__playwright__* toolsNo new CRD fields required. The existing AgentTaskSpec covers all
needs:
agent: publisherprompt: "Write a blog post about <topic>"trigger: manualreadOnly: falseThe entrypoint script handles branch naming and PR creation outside the CRD model.
Webhook trigger:
POST :8080/webhook
Authorization: Bearer <token>
{
"agent": "publisher",
"prompt": "Write a blog post about autonomous AI agent pipelines in K8s",
"runtime": "claude"
}
Entrypoint script contract:
Input: $1 = prompt string (from AgentTask.spec.prompt)
Output: exit 0 = success (branch pushed, PR created, Discord notified)
exit 1 = failure (Discord notified with error, branch preserved)
Env: CLAUDE_CODE_OAUTH_TOKEN (required)
GITHUB_TOKEN (required for gh pr create)
MCP config at /tmp/mcp.json (written by controller)
Discord notification format (success):
Publisher completed: "Write a blog post about <topic>"
PR: https://github.com/kylep/multi/pull/XX
Branch: agent/publisher-1710636000
Discord notification format (failure):
Publisher FAILED: "Write a blog post about <topic>"
Stage: QA verification
Error: Blog build exited with code 1
Branch: agent/publisher-1710636000 (preserved for debugging)
| Option | Pros | Cons | Verdict |
|---|---|---|---|
mcr.microsoft.com/playwright:v1.58.2-noble |
Chromium + deps pre-installed, officially supported, glibc | Larger image (~1.5GB), Ubuntu-based, breaks from Alpine convention | Chosen -- Playwright explicitly does not support Alpine (musl). This is the only supported path. |
node:22-slim + manual Chromium install |
Debian slim is smaller than Playwright image, more control | Fragile: must track Chromium deps manually, version mismatches | Rejected -- Playwright docs warn against this approach |
| Keep Alpine + Chromium from Alpine repos | Stays consistent with current image | Playwright does not support musl/Alpine. Browser builds require glibc. | Rejected -- technically not possible |
| Option | Pros | Cons | Verdict |
|---|---|---|---|
Agent-driven (claude --agent publisher) |
Publisher agent definition already handles subagent delegation, no controller changes for pipeline logic, same behavior as interactive | Controller can't observe intermediate stages | Chosen -- the publisher agent already works interactively. Changing orchestration for K8s would create divergent behavior. |
| Controller-driven multi-step | Controller can track each stage, retry individual steps | Duplicates orchestration logic, Go code becomes coupled to pipeline stages | Rejected -- too much controller complexity |
| Entrypoint script orchestration | Shell script calls each subagent separately | Loses publisher agent's context between stages, breaks adversarial review loop | Rejected -- the review loop requires agent context |
| Option | Pros | Cons | Verdict |
|---|---|---|---|
CLAUDE_CODE_OAUTH_TOKEN via claude setup-token |
1-year token, works with Max subscription, community-proven | Not officially supported M2M path, Anthropic closed M2M request as NOT_PLANNED | Chosen -- only viable path for Max tokens. No fallback by design decision. |
| API key with OpenRouter passthrough | Proven, currently works for journalist | Costs money on OpenRouter, doesn't use Max allocation | Rejected -- defeats the purpose (PRD success metric: 100% Max billing) |
| Wait for official M2M auth | Would be the "right" way | Anthropic marked NOT_PLANNED. Could be months/years. | Rejected -- blocks project indefinitely |
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| Non-RFC1918 allowlist (OpenClaw pattern) | Simple, proven in this cluster, allows WebFetch research | Doesn't restrict specific domains | Chosen -- matches Kyle's preference, allows researcher subagent to fetch arbitrary public URLs |
| Domain-specific allowlist | Tightest control, explicit about what's allowed | WebFetch needs arbitrary domains, would require a proxy | Rejected -- too restrictive for research |
| Broad egress (no policy) | Simplest | No isolation at all | Rejected -- PRD requires sandboxing |
| Option | Pros | Cons | Verdict |
|---|---|---|---|
Entrypoint script in repo (bin/run-publisher.sh) |
Logic lives in repo (not Go or agent def), easy to iterate, handles git + PR + notification in one place | Extra file to maintain | Chosen -- Kyle's preference. Keeps controller simple, keeps agent definition unchanged. |
| Controller creates branch in Go | Centralized, works for any agent | Couples controller to git branching, Go code bloat | Rejected -- controller should stay generic |
| Agent creates branch | No new scripts | Agent definition would need git-specific instructions that diverge from interactive use | Rejected -- divergent behavior between interactive and autonomous |
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| Per-branch PV/PVC (hostPath) | Full isolation between runs, enables parallelism, no stale state leakage, cleanup is simple (delete PV+PVC) | More PV/PVC churn, controller manages lifecycle | Chosen -- OOMKill from parallel resource contention and stale state leaking between runs motivated the switch. /tmp is ephemeral on Rancher Desktop so no manual hostPath cleanup needed. |
| Shared PVC with write serialization | Simple, already implemented | Prevents parallelism, stale state leaks between runs, caused OOMKill when two agents competed for resources | Rejected -- was the v1 approach but caused real problems |
| emptyDir per pod | Zero PV management, automatic cleanup | Data lost immediately on pod termination (can't inspect failed runs), size limited by node tmpfs | Rejected -- need to inspect workspace after failures for debugging |
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| GitHub App installation tokens | Scoped to specific repo, short-lived (1hr), no PAT to rotate, branch protection respected | Requires JWT generation in controller, needs App setup | Chosen -- installation tokens are scoped to kylep/multi only, expire in 1hr, and branch protection prevents direct pushes to main |
| Personal Access Token (PAT) | Simple, already partially implemented | Long-lived, broad scope, must rotate manually | Rejected -- too broad for autonomous agents |
| Deploy keys | Repo-scoped | Write deploy keys can push to any branch including main, no PR creation | Rejected -- no branch protection enforcement |
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| Playwright MCP in-container | Identical behavior to interactive mode, QA agent uses same tools | Adds ~400MB to image, needs /dev/shm config | Chosen -- Kyle's preference. Same agent code works in both environments. |
| Playwright CLI (npx playwright test) | No MCP server needed | QA agent would need a different code path for container mode | Rejected -- divergent behavior |
| Build-only verification (no browser) | Simplest, smallest image | Misses render bugs, doesn't satisfy PRD acceptance criterion for browser verification | Rejected -- PRD requires "renders correctly in a browser" |
| Action | File | Rationale |
|---|---|---|
| MODIFY | infra/ai-agent-runtime/Dockerfile |
Switch base to Playwright image, add @playwright/mcp, gh CLI, bake in ~/.claude.json onboarding bypass |
| CREATE | apps/blog/bin/run-publisher.sh |
Entrypoint script: branch creation, Claude invocation, git push, PR creation, Discord notification |
| MODIFY | infra/agent-controller/pkg/controller/controller.go |
Add publisher-specific command building (entrypoint script invocation), Playwright MCP in config, /dev/shm volume |
| MODIFY | infra/agent-controller/helm/templates/secret.yaml |
Add CLAUDE_CODE_OAUTH_TOKEN and GITHUB_TOKEN fields |
| MODIFY | infra/agent-controller/helm/values.yaml |
Add new secret defaults, bump runtime image tag to 0.3 |
| CREATE | infra/agent-controller/helm/templates/networkpolicy.yaml |
Non-RFC1918 egress policy for agent pods (OpenClaw pattern) |
| CREATE | infra/agent-controller/config/samples/publisher-manual.yaml |
Sample AgentTask for manual publisher runs |
| MODIFY | apps/blog/blog/markdown/wiki/devops/agent-controller.md |
Document publisher task, Max auth, credential rotation |
| MODIFY | apps/blog/blog/markdown/wiki/custom-tools/docker-images/ai-agent-runtime.md |
Document new base image, Playwright, version bump |
setup-token works and tokens authenticate correctlyclaude setup-token on Kyle's MacBook, obtain tokenCLAUDE_CODE_OAUTH_TOKEN env var and run claude -p "echo hello" --output-format text -- confirms authhasCompletedOnboarding: true in ~/.claude.json allows headless execution~/.claude.json structure needed for the runtime imageinfra/ai-agent-runtime/Dockerfile~/.claude.json structure)mcr.microsoft.com/playwright:v1.58.2-noble (latest stable at implementation time)claude --version succeedsnpx @playwright/mcp@latest --help succeedsmcp[cli], httpx) installedgh CLI installed~/.claude.json with {"hasCompletedOnboarding": true} exists at /home/pwuser/.claude.jsonpwuser from Playwright base)kpericak/ai-agent-runtime:0.3apps/blog/bin/run-publisher.shagent/publisher-$(date +%s) from current HEADANTHROPIC_API_KEY, ANTHROPIC_AUTH_TOKEN, ANTHROPIC_BASE_URL before invoking Claude (prevents auth conflict)claude --mcp-config /tmp/mcp.json --agent publisher -p "$1" --output-format textgit push -u origin <branch>, gh pr create --base main --title "<topic>" --body "Autonomous publisher run"chmod +x)infra/agent-controller/pkg/controller/controller.gobuildCommand() detects agent == "publisher" and routes to entrypoint scriptplaywright server: {"type": "stdio", "command": "npx", "args": ["-y", "@playwright/mcp@latest", "--headless"]}discord and google-news servers$1 to entrypoint scriptinfra/agent-controller/pkg/controller/controller.go/dev/shm with medium: Memory and sizeLimit: 1Gi/dev/shm in the agent containershareProcessNamespace: true) to prevent zombie Chromium processes/workspace (PVC) and /dev/shm (emptyDir) -- no hostPath mounts, no access to host filesystemreadOnlyRootFilesystem: false but allowPrivilegeEscalation: false and capabilities: drop: [ALL]/etc/shadow, cannot write outside /workspace and /home, cannot see host processes (TASK-009)infra/agent-controller/helm/templates/secret.yaml, infra/agent-controller/helm/values.yamlsecret.yaml includes CLAUDE_CODE_OAUTH_TOKEN field with lookup-preserve patternsecret.yaml includes GITHUB_TOKEN field with lookup-preserve patternvalues.yaml has empty defaults for new secretshelm template renders correctly with new fieldsOPENROUTER_API_KEY, DISCORD_BOT_TOKEN, etc.) are preservedinfra/agent-controller/helm/templates/networkpolicy.yaml0.0.0.0/0 except 10.0.0.0/8, 172.16.0.0/12, 192.168.0.0/16agents.kyle.pericak.com/agent (Exists operator)api.anthropic.com but cannot reach 192.168.1.1 (TASK-009)infra/agent-controller/config/samples/publisher-manual.yamlagent: publisher, trigger: manualallowedTools needed -- entrypoint script manages Claude invocationkubectl apply -f succeeds and controller recognizes the task (TASK-009)ai-agent-runtime:0.3agent-controller:0.6helm upgrade deploys both new imagesinfra/agent-controller/pkg/controller/controller.go,
infra/agent-controller/main.go,
infra/agent-controller/helm/templates/rbac.yaml,
infra/agent-controller/helm/templates/deployment.yaml,
infra/agent-controller/helm/values.yamlagent-ws-<jobName>)agent-workspace PVCcanRunWrite() serialization removedgo build and helm template passinfra/agent-controller/pkg/controller/controller.go,
infra/agent-controller/main.go,
infra/agent-controller/helm/templates/secret.yaml,
infra/agent-controller/helm/templates/deployment.yaml,
infra/agent-controller/helm/values.yaml,
apps/blog/bin/run-publisher.shGITHUB_TOKEN injected into write agent podsrun-publisher.sh pushes branch and creates PRGITHUB_APP_ID, GITHUB_APP_PRIVATE_KEY, GITHUB_INSTALL_IDapps/blog/blog/markdown/wiki/devops/agent-controller.md, apps/blog/blog/markdown/wiki/custom-tools/docker-images/ai-agent-runtime.mdCLAUDE_CODE_OAUTH_TOKEN via claude setup-token and patching the K8s secretChanges made during TASK-009 deployment that weren't in the original design.
Dropped google-news MCP from controller. The journalist agent
uses WebSearch/WebFetch instead of the custom google-news MCP server.
Removes the npm ci setup step from the agent command, simplifying
startup and eliminating a permission issue (node_modules owned by
wrong UID).
UID unified to 1001 for all agents. The design specified UID 1000 for non-publisher agents and 1001 for publisher. Since all agents now use the Playwright-based runtime image (pwuser=1001), the controller chowns to 1001 for all agents.
Dropped git push, PR creation, and Discord webhook from run-publisher.sh. The agent writes to a local branch on the PVC. Kyle reviews from the filesystem. GitHub App integration for push/PR will come later.
Discord #log channel for controller observability. The controller posts to Discord #log on job start (with UUID, agent, prompt preview) and on job completion/failure. Uses the Discord bot API directly from Go, not MCP.
Per-branch PVCs for write agents. The shared PVC caused two
problems: stale state leaking between runs (git worktree artifacts,
node_modules from previous runs) and OOMKill when parallel resource
contention occurred. Write agents now get isolated PV/PVC pairs
(agent-ws-<jobName>) with hostPath under
/tmp/agent-workspace/branches/. Read-only agents keep the shared PVC.
The canRunWrite() serialization was removed since per-branch PVCs
eliminate the need for write serialization.
GitHub App auth for git push and PR creation. The PericakAI
GitHub App (ID 3100834, installed on kylep/multi only) provides
scoped auth. The controller generates a JWT, exchanges it for a
short-lived installation token (1hr), and injects it as
GITHUB_TOKEN on write agent pods. run-publisher.sh uses this
token for git push and gh pr create. Branch protection on main
prevents direct pushes — agents can only create branches and PRs.
Switched --output-format text to --output-format stream-json.
Enables streaming structured output to pod logs for real-time
monitoring via kubectl logs -f.
--allowedTools required for headless Claude Code. Discovered
that Claude Code in headless mode (-p flag) blocks all tool use
unless --allowedTools is explicitly passed. The journalist CRD
already had this; webhook-triggered tasks must include it too.
The agent controller now has a reproducible setup path:
secrets/export-agent-controller.sh.SAMPLE lists every env var
the controller needs, with comments on where to get each value.infra/agent-controller/bin/setup.sh bootstraps from scratch:
checks prereqs, sources secrets, runs helm install, patches the K8s
secret. Optional --build-images flag builds and pushes both Docker
images first.What's automated: namespace creation, helm install/upgrade, secret patching (including base64-encoding the PEM key).
What's manual: populating the exports file, GitHub App setup (creating the App, noting the install ID), and branch protection rulesets on main.
claude setup-token scope regression (PRD OQ #1). Issue #23703
reported a scope regression. TASK-001 validates this before any
implementation begins. If setup-token is broken, the entire project
is blocked until Anthropic fixes it or a workaround is found.
OAuth refresh token race condition. Issue #24317 reports race conditions with concurrent sessions. Since we only run one publisher at a time (write serialization), this should not apply. Monitor during TASK-009.
Playwright image version pinning. The design specifies
v1.58.2-noble but should track the latest stable Playwright release
at implementation time. Pin to a specific version, don't use latest.
gh CLI auth in container. The script needs GITHUB_TOKEN env
var for gh pr create. Verify that gh reads this env var without
requiring gh auth login first.
Discord notification mechanism. The entrypoint script can either use a simple Discord webhook URL (curl POST) or invoke the Discord MCP server. The webhook is simpler and doesn't require MCP. Decision deferred to TASK-003 implementation.
Auth instability (PRD risk #1). CLAUDE_CODE_OAUTH_TOKEN is a
community workaround. No fallback by design. Mitigation: TASK-001
validates before investing in other tasks. If auth breaks post-deploy,
publisher runs stop until Anthropic provides a fix or alternative.
Image size regression. The Playwright base image is ~1.5GB vs ~200MB for Alpine. This affects pull times and disk usage. Mitigation: image is cached on the node after first pull. Acceptable tradeoff for browser verification capability.
Playwright base image breaks existing agents. Switching from Alpine to Ubuntu could break Python/Node paths or MCP server deps. Mitigation: TASK-002 includes journalist regression test. TASK-009 includes full regression.
Token expiry mid-run (PRD risk #4). Publisher pipeline can take 30+ minutes. If the 1-year token expires during a run, it fails partway. Mitigation: token lasts 1 year; risk is negligible unless credentials are not rotated. Document rotation procedure in TASK-010.
Promotion deadline (PRD risk #2). March 28 deadline for doubled off-peak limits. Tasks are ordered for fastest path to a working prototype: TASK-001 (validation) gates everything, then TASK-002/003 (image + script) can run in parallel, enabling a test run by ~day 5.