Autonomous Publisher Pipeline -- Design Doc
Context
Link to PRD: Autonomous Publisher Pipeline
The publisher pipeline (research, write, review, QA, security audit) has only ever run interactively on Kyle's MacBook. Moving it to K8s requires solving four problems the journalist agent never had to: Claude Max OAuth auth (replacing OpenRouter), Playwright browser verification in a container (the journalist has no QA step), branch lifecycle management (the journalist commits to main), and MCP server configuration for Playwright alongside the existing Discord and Google News servers.
The runtime image also needs a base image change: Playwright requires
glibc, and the current node:22-alpine base uses musl.
Goals and Non-Goals
Goals:
- Authenticate autonomous publisher runs against the Claude Max
subscription via
CLAUDE_CODE_OAUTH_TOKEN - Run the full publisher pipeline (research, write, review loop, QA with Playwright, security audit) without human intervention
- Self-verify output: blog build succeeds, post renders in a headless browser, passes adversarial review
- Isolate the agent: filesystem to workspace only, network egress restricted to non-RFC1918 addresses (OpenClaw pattern)
- Notify Kyle via Discord on completion (with auto-PR link) and on failure (with error context)
Non-Goals:
- No auto-merge or production deployment (Kyle reviews and merges)
- No auto-topic selection (Kyle triggers manually in v1)
- No multi-pipeline concurrency in v1 (per-branch PVCs enable it but not yet tested with concurrent runs)
- No OpenRouter fallback -- commit fully to Max auth
- No database migration or new CRD fields beyond what exists
- No cloud K8s migration (stays on local Rancher Desktop)
Proposed Design
graph TD
subgraph "Kyle's Machine"
K[Kyle] -->|webhook / kubectl| WH[Controller Webhook]
end
subgraph "K8s: ai-agents namespace"
WH --> AT[AgentTask CRD
agent: publisher]
AT --> CTRL[Agent Controller]
CTRL -->|creates| JOB[K8s Job]
subgraph "Job Pod"
INIT[Init Container
alpine/git] -->|git sync| MAIN
MAIN[Entrypoint Script
bin/run-publisher.sh] -->|creates branch| CC[Claude Code
--agent publisher]
CC -->|subagent| RES[Researcher]
CC -->|subagent| REV[Reviewer]
CC -->|subagent via MCP| QA[QA + Playwright]
CC -->|subagent| SEC[Security Auditor]
CC -->|on success| GIT[git push + gh pr create]
CC -->|MCP| DISC[Discord Notification]
end
JOB --> PVC[(Per-Branch PVC
/workspace)]
NP[NetworkPolicy] -.->|allow non-RFC1918
block LAN| JOB
end
CC -->|CLAUDE_CODE_OAUTH_TOKEN| ANTHROPIC[api.anthropic.com
Claude Max]
style ANTHROPIC fill:#e8d5f5
style NP fill:#f5d5d5
Component Details
1. Runtime Image (ai-agent-runtime:0.4)
Responsibility: Provide Claude Code CLI, Playwright, Chromium, and all MCP server dependencies in a single image.
File path: infra/ai-agents/ai-agent-runtime/Dockerfile
Key change: Base image switches from node:22-alpine to the
official Playwright image mcr.microsoft.com/playwright:v1.58.2-noble
which includes Chromium, system fonts, and glibc. Node.js 22 is
available in this image.
Key interfaces:
- Inherits Chromium + system deps from Playwright base
- Installs: Claude Code CLI,
@playwright/mcp, Python 3, pip, MCP server deps, Git,ghCLI - User: non-root
pwuser(UID 1001, from Playwright base image) - Entrypoint:
/bin/sh(unchanged) - Claude onboarding bypass:
~/.claude.jsonwith{"hasCompletedOnboarding": true}baked into the image
2. Entrypoint Script (bin/run-publisher.sh)
Responsibility: Branch lifecycle, Claude Code invocation, git push, PR creation, and Discord notification. This is the command the controller runs instead of inline Claude Code invocation.
File path: apps/blog/bin/run-publisher.sh
Key interfaces:
- Input:
$1= topic/prompt string - Env vars consumed:
CLAUDE_CODE_OAUTH_TOKEN,DISCORD_WEBHOOK_URL(or uses Discord MCP),GITHUB_TOKEN(forgh pr create) - Creates branch
agent/publisher-$(date +%s)from current HEAD - Invokes
claude --mcp-config /tmp/mcp.json --agent publisher -p "$1" --output-format text - On success:
git push,gh pr create --base main, posts PR link to Discord - On failure: posts error context to Discord, exits non-zero
- Does NOT manage the dev server -- the QA subagent handles that
internally via
bin/start-dev-bg.sh
3. Controller Changes
Responsibility: Extended buildCommand() to support publisher-
specific configuration: Playwright MCP server in the MCP config JSON,
OAuth token injection, and gh CLI auth.
File path: infra/ai-agents/agent-controller/pkg/controller/controller.go
Key changes:
buildCommand()detectsagent == "publisher"and uses the entrypoint script instead of inline command construction- MCP config includes Playwright MCP server:
{"playwright": {"type": "stdio", "command": "npx", "args": ["-y", "@playwright/mcp@latest", "--headless", "--no-sandbox"]}} - Job pod spec adds
/dev/shmemptyDir volume (required for Chromium) - Job pod spec adds
--initequivalent viashareProcessNamespaceor a proper init process - Per-branch PVC lifecycle: write agents get isolated PV/PVC pairs
(
createBranchPVC), cleaned up after job completion (cleanupBranchPVCs). Read-only agents continue using the shared PVC. canRunWrite()serialization removed — per-branch PVCs eliminate the need for write serialization- GitHub App installation token generation (
getInstallationToken) provides scoped auth for git push and PR creation
4. Secret Changes
Responsibility: Store Claude Max OAuth token and GitHub token alongside existing secrets.
File path: infra/ai-agents/agent-controller/helm/templates/secret.yaml
Key additions:
CLAUDE_CODE_OAUTH_TOKEN-- the 1-year token fromclaude setup-tokenGITHUB_TOKEN-- PAT forgh pr create(or usegh auth login --with-token)
Auth model change: For publisher runs, the controller must NOT
inject ANTHROPIC_API_KEY, ANTHROPIC_AUTH_TOKEN, or
ANTHROPIC_BASE_URL -- these conflict with CLAUDE_CODE_OAUTH_TOKEN.
The entrypoint script unsets them before invoking Claude Code.
5. NetworkPolicy
Responsibility: Restrict agent pod network egress to non-RFC1918 addresses only (following the OpenClaw pattern).
File path: infra/ai-agents/agent-controller/helm/templates/networkpolicy.yaml
Key interfaces:
- Deny all ingress
- Allow DNS to kube-system
- Allow HTTPS (443) to
0.0.0.0/0except10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 - Applies to pods with label
agents.kyle.pericak.com/agent
6. AgentTask Sample (publisher-manual.yaml)
Responsibility: Declarative task definition for manual publisher runs.
File path: infra/ai-agents/agent-controller/config/samples/publisher-manual.yaml
Key fields:
agent: publishertrigger: manualreadOnly: falseallowedTools: full set including Bash, Read, Write, Edit, Glob, Grep, Agent, and allmcp__playwright__*tools
Data Model
No new CRD fields required. The existing AgentTaskSpec covers all
needs:
agent: publisherprompt: "Write a blog post about <topic>"trigger: manualreadOnly: false
The entrypoint script handles branch naming and PR creation outside the CRD model.
API / Interface Contracts
Webhook trigger:
POST :8080/webhook
Authorization: Bearer <token>
{
"agent": "publisher",
"prompt": "Write a blog post about autonomous AI agent pipelines in K8s",
"runtime": "claude"
}
Entrypoint script contract:
Input: $1 = prompt string (from AgentTask.spec.prompt)
Output: exit 0 = success (branch pushed, PR created, Discord notified)
exit 1 = failure (Discord notified with error, branch preserved)
Env: CLAUDE_CODE_OAUTH_TOKEN (required)
GITHUB_TOKEN (required for gh pr create)
MCP config at /tmp/mcp.json (written by controller)
Discord notification format (success):
Publisher completed: "Write a blog post about <topic>"
PR: https://github.com/kylep/multi/pull/XX
Branch: agent/publisher-1710636000
Discord notification format (failure):
Publisher FAILED: "Write a blog post about <topic>"
Stage: QA verification
Error: Blog build exited with code 1
Branch: agent/publisher-1710636000 (preserved for debugging)
Alternatives Considered
Decision: Base image for Playwright support
| Option | Pros | Cons | Verdict |
|---|---|---|---|
mcr.microsoft.com/playwright:v1.58.2-noble |
Chromium + deps pre-installed, officially supported, glibc | Larger image (~1.5GB), Ubuntu-based, breaks from Alpine convention | Chosen -- Playwright explicitly does not support Alpine (musl). This is the only supported path. |
node:22-slim + manual Chromium install |
Debian slim is smaller than Playwright image, more control | Fragile: must track Chromium deps manually, version mismatches | Rejected -- Playwright docs warn against this approach |
| Keep Alpine + Chromium from Alpine repos | Stays consistent with current image | Playwright does not support musl/Alpine. Browser builds require glibc. | Rejected -- technically not possible |
Decision: Publisher orchestration model
| Option | Pros | Cons | Verdict |
|---|---|---|---|
Agent-driven (claude --agent publisher) |
Publisher agent definition already handles subagent delegation, no controller changes for pipeline logic, same behavior as interactive | Controller can't observe intermediate stages | Chosen -- the publisher agent already works interactively. Changing orchestration for K8s would create divergent behavior. |
| Controller-driven multi-step | Controller can track each stage, retry individual steps | Duplicates orchestration logic, Go code becomes coupled to pipeline stages | Rejected -- too much controller complexity |
| Entrypoint script orchestration | Shell script calls each subagent separately | Loses publisher agent's context between stages, breaks adversarial review loop | Rejected -- the review loop requires agent context |
Decision: Auth approach for Claude Max
| Option | Pros | Cons | Verdict |
|---|---|---|---|
CLAUDE_CODE_OAUTH_TOKEN via claude setup-token |
1-year token, works with Max subscription, community-proven | Not officially supported M2M path, Anthropic closed M2M request as NOT_PLANNED | Chosen -- only viable path for Max tokens. No fallback by design decision. |
| API key with OpenRouter passthrough | Proven, currently works for journalist | Costs money on OpenRouter, doesn't use Max allocation | Rejected -- defeats the purpose (PRD success metric: 100% Max billing) |
| Wait for official M2M auth | Would be the "right" way | Anthropic marked NOT_PLANNED. Could be months/years. | Rejected -- blocks project indefinitely |
Decision: Network isolation approach
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| Non-RFC1918 allowlist (OpenClaw pattern) | Simple, proven in this cluster, allows WebFetch research | Doesn't restrict specific domains | Chosen -- matches Kyle's preference, allows researcher subagent to fetch arbitrary public URLs |
| Domain-specific allowlist | Tightest control, explicit about what's allowed | WebFetch needs arbitrary domains, would require a proxy | Rejected -- too restrictive for research |
| Broad egress (no policy) | Simplest | No isolation at all | Rejected -- PRD requires sandboxing |
Decision: Branch management
| Option | Pros | Cons | Verdict |
|---|---|---|---|
Entrypoint script in repo (bin/run-publisher.sh) |
Logic lives in repo (not Go or agent def), easy to iterate, handles git + PR + notification in one place | Extra file to maintain | Chosen -- Kyle's preference. Keeps controller simple, keeps agent definition unchanged. |
| Controller creates branch in Go | Centralized, works for any agent | Couples controller to git branching, Go code bloat | Rejected -- controller should stay generic |
| Agent creates branch | No new scripts | Agent definition would need git-specific instructions that diverge from interactive use | Rejected -- divergent behavior between interactive and autonomous |
Decision: Branch workspace isolation
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| Per-branch PV/PVC (hostPath) | Full isolation between runs, enables parallelism, no stale state leakage, cleanup is simple (delete PV+PVC) | More PV/PVC churn, controller manages lifecycle | Chosen -- OOMKill from parallel resource contention and stale state leaking between runs motivated the switch. /tmp is ephemeral on Rancher Desktop so no manual hostPath cleanup needed. |
| Shared PVC with write serialization | Simple, already implemented | Prevents parallelism, stale state leaks between runs, caused OOMKill when two agents competed for resources | Rejected -- was the v1 approach but caused real problems |
| emptyDir per pod | Zero PV management, automatic cleanup | Data lost immediately on pod termination (can't inspect failed runs), size limited by node tmpfs | Rejected -- need to inspect workspace after failures for debugging |
Decision: GitHub auth for agent push/PR
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| GitHub App installation tokens | Scoped to specific repo, short-lived (1hr), no PAT to rotate, branch protection respected | Requires JWT generation in controller, needs App setup | Chosen -- installation tokens are scoped to kylep/multi only, expire in 1hr, and branch protection prevents direct pushes to main |
| Personal Access Token (PAT) | Simple, already partially implemented | Long-lived, broad scope, must rotate manually | Rejected -- too broad for autonomous agents |
| Deploy keys | Repo-scoped | Write deploy keys can push to any branch including main, no PR creation | Rejected -- no branch protection enforcement |
Decision: QA verification in container
| Option | Pros | Cons | Verdict |
|---|---|---|---|
| Playwright MCP in-container | Identical behavior to interactive mode, QA agent uses same tools | Adds ~400MB to image, needs /dev/shm config | Chosen -- Kyle's preference. Same agent code works in both environments. |
| Playwright CLI (npx playwright test) | No MCP server needed | QA agent would need a different code path for container mode | Rejected -- divergent behavior |
| Build-only verification (no browser) | Simplest, smallest image | Misses render bugs, doesn't satisfy PRD acceptance criterion for browser verification | Rejected -- PRD requires "renders correctly in a browser" |
File Change List
| Action | File | Rationale |
|---|---|---|
| MODIFY | infra/ai-agents/ai-agent-runtime/Dockerfile |
Switch base to Playwright image, add @playwright/mcp, gh CLI, bake in ~/.claude.json onboarding bypass |
| CREATE | apps/blog/bin/run-publisher.sh |
Entrypoint script: branch creation, Claude invocation, git push, PR creation, Discord notification |
| MODIFY | infra/ai-agents/agent-controller/pkg/controller/controller.go |
Add publisher-specific command building (entrypoint script invocation), Playwright MCP in config, /dev/shm volume |
| MODIFY | infra/ai-agents/agent-controller/helm/templates/secret.yaml |
Add CLAUDE_CODE_OAUTH_TOKEN and GITHUB_TOKEN fields |
| MODIFY | infra/ai-agents/agent-controller/helm/values.yaml |
Add new secret defaults, bump runtime image tag to 0.3 |
| CREATE | infra/ai-agents/agent-controller/helm/templates/networkpolicy.yaml |
Non-RFC1918 egress policy for agent pods (OpenClaw pattern) |
| CREATE | infra/ai-agents/agent-controller/config/samples/publisher-manual.yaml |
Sample AgentTask for manual publisher runs |
| MODIFY | apps/blog/blog/markdown/wiki/devops/agent-controller.md |
Document publisher task, Max auth, credential rotation |
| MODIFY | apps/blog/blog/markdown/wiki/custom-tools/docker-images/ai-agent-runtime.md |
Document new base image, Playwright, version bump |
Task Breakdown
TASK-001: Validate Claude Max OAuth token
- Requirement: PRD story "Use Claude Max tokens" -- confirm
setup-tokenworks and tokens authenticate correctly - Files: None (manual validation task)
- Dependencies: None
- Acceptance criteria:
- Run
claude setup-tokenon Kyle's MacBook, obtain token - Set
CLAUDE_CODE_OAUTH_TOKENenv var and runclaude -p "echo hello" --output-format text-- confirms auth - Verify the run bills against Max subscription (check claude.ai usage dashboard)
- Confirm
hasCompletedOnboarding: truein~/.claude.jsonallows headless execution - Document the exact
~/.claude.jsonstructure needed for the runtime image
- Run
TASK-002: Rebuild runtime image with Playwright base
- Requirement: PRD story "Self-verifying output" -- runtime must support Chromium for QA verification
- Files:
infra/ai-agents/ai-agent-runtime/Dockerfile - Dependencies: TASK-001 (need
~/.claude.jsonstructure) - Acceptance criteria:
- Base image is
mcr.microsoft.com/playwright:v1.58.2-noble(latest stable at implementation time) - Claude Code CLI installed and
claude --versionsucceeds -
npx @playwright/mcp@latest --helpsucceeds - Python 3, pip, MCP server deps (
mcp[cli],httpx) installed -
ghCLI installed -
~/.claude.jsonwith{"hasCompletedOnboarding": true}exists at/home/pwuser/.claude.json - Image runs as UID 1001 (non-root,
pwuserfrom Playwright base) - Image tagged as
kpericak/ai-agent-runtime:0.3 - Existing journalist agent still works with the new image (regression test)
- Base image is
TASK-003: Create entrypoint script ✅
- Requirement: PRD stories "Trigger an autonomous publisher run" and "Self-verifying output" -- the script handles branch lifecycle, invocation, and notification
- Files:
apps/blog/bin/run-publisher.sh - Dependencies: TASK-001 (auth model confirmed)
- Completion notes: task-003-entrypoint-script.md
- Acceptance criteria:
- Script creates branch
agent/publisher-$(date +%s)from current HEAD - Script unsets
ANTHROPIC_API_KEY,ANTHROPIC_AUTH_TOKEN,ANTHROPIC_BASE_URLbefore invoking Claude (prevents auth conflict) - Script invokes
claude --mcp-config /tmp/mcp.json --agent publisher -p "$1" --output-format text - On success: runs
git push -u origin <branch>,gh pr create --base main --title "<topic>" --body "Autonomous publisher run" - On success: posts PR link to Discord via curl webhook
- On failure: posts error context to Discord, preserves branch, exits non-zero
- Script is executable (
chmod +x) - Script works end-to-end in container (deferred to TASK-009)
- Script creates branch
TASK-004: Add Playwright MCP to controller command building ✅
- Requirement: PRD story "Self-verifying output" -- QA subagent needs Playwright MCP tools available
- Files:
infra/ai-agents/agent-controller/pkg/controller/controller.go - Dependencies: TASK-002 (Playwright in image)
- Completion notes: task-004-008-controller-helm.md
- Acceptance criteria:
-
buildCommand()detectsagent == "publisher"and routes to entrypoint script - MCP config JSON includes
playwrightserver:{"type": "stdio", "command": "npx", "args": ["-y", "@playwright/mcp@latest", "--headless"]} - MCP config still includes
discordandgoogle-newsservers - Non-publisher agents are unaffected (same command as before)
- Publisher command passes prompt as
$1to entrypoint script
-
TASK-005: Add /dev/shm volume and Chromium pod config ✅
- Requirement: PRD story "Self-verifying output" -- Chromium requires shared memory and specific pod security settings
- Files:
infra/ai-agents/agent-controller/pkg/controller/controller.go - Dependencies: TASK-004
- Completion notes: task-004-008-controller-helm.md
- Acceptance criteria:
- Job pod spec includes emptyDir volume for
/dev/shmwithmedium: MemoryandsizeLimit: 1Gi - Volume is mounted at
/dev/shmin the agent container - Pod includes a proper init process (
shareProcessNamespace: true) to prevent zombie Chromium processes - Pod securityContext enforces filesystem isolation: agent container mounts only
/workspace(PVC) and/dev/shm(emptyDir) -- no hostPath mounts, no access to host filesystem - Agent runs as non-root (UID 1001 for publisher, 1000 for others) with
readOnlyRootFilesystem: falsebutallowPrivilegeEscalation: falseandcapabilities: drop: [ALL] - Verify from inside the pod: agent cannot access
/etc/shadow, cannot write outside/workspaceand/home, cannot see host processes (TASK-009) - Chromium launches successfully in the pod (TASK-009)
- Job pod spec includes emptyDir volume for
TASK-006: Add secrets for Max auth and GitHub token ✅
- Requirement: PRD story "Use Claude Max tokens" -- OAuth token must be injected into the pod
- Files:
infra/ai-agents/agent-controller/helm/templates/secret.yaml,infra/ai-agents/agent-controller/helm/values.yaml - Dependencies: TASK-001 (have the token)
- Completion notes: task-004-008-controller-helm.md
- Acceptance criteria:
-
secret.yamlincludesCLAUDE_CODE_OAUTH_TOKENfield with lookup-preserve pattern -
secret.yamlincludesGITHUB_TOKENfield with lookup-preserve pattern -
values.yamlhas empty defaults for new secrets -
helm templaterenders correctly with new fields - Existing secrets (
OPENROUTER_API_KEY,DISCORD_BOT_TOKEN, etc.) are preserved
-
TASK-007: Create NetworkPolicy ✅
- Requirement: PRD story "Sandboxed execution" -- restrict network egress to non-RFC1918 addresses
- Files:
infra/ai-agents/agent-controller/helm/templates/networkpolicy.yaml - Dependencies: None
- Completion notes: task-004-008-controller-helm.md
- Acceptance criteria:
- Policy denies all ingress to agent pods
- Policy allows DNS to kube-system namespace
- Policy allows HTTP (80) + HTTPS (443) to
0.0.0.0/0except10.0.0.0/8,172.16.0.0/12,192.168.0.0/16 - Policy selects pods with label
agents.kyle.pericak.com/agent(Exists operator) - Verification: agent pod can reach
api.anthropic.combut cannot reach192.168.1.1(TASK-009) - Existing journalist agent still works with the policy applied (TASK-009)
TASK-008: Create publisher AgentTask sample ✅
- Requirement: PRD story "Trigger an autonomous publisher run"
- Files:
infra/ai-agents/agent-controller/config/samples/publisher-manual.yaml - Dependencies: None
- Completion notes: task-004-008-controller-helm.md
- Acceptance criteria:
- Valid AgentTask YAML with
agent: publisher,trigger: manual - No
allowedToolsneeded -- entrypoint script manages Claude invocation -
kubectl apply -fsucceeds and controller recognizes the task (TASK-009) - Task prompt is a placeholder that Kyle replaces per run
- Valid AgentTask YAML with
TASK-009: Build, deploy, and end-to-end test
- Requirement: PRD success metric "Hands-off execution" -- full pipeline completes without human intervention
- Files: All modified files from TASK-002 through TASK-008
- Dependencies: TASK-002, TASK-003, TASK-004, TASK-005, TASK-006, TASK-007, TASK-008
- Deployment runbook: task-009-deploy-runbook.md
- Acceptance criteria:
- Build and push
ai-agent-runtime:0.3 - Build and push
agent-controller:0.6 -
helm upgradedeploys both new images - Trigger a publisher run via webhook with a test topic
- Run completes end-to-end without human intervention
- Blog post builds and renders (QA passes)
- Adversarial review loop executed (verify Claude Code output logs show at least one reviewer subagent invocation)
- Branch is pushed, PR is created
- Discord notification received
- Claude.ai usage dashboard shows the run billed against Max subscription
- Journalist agent still works (regression test)
- Network policy is active (verify with curl test from debug pod)
- Build and push
TASK-011: Per-branch PVCs for write agents
- Requirement: Eliminate stale state leakage and enable parallel runs
- Files:
infra/ai-agents/agent-controller/pkg/controller/controller.go,infra/ai-agents/agent-controller/main.go,infra/ai-agents/agent-controller/helm/templates/rbac.yaml,infra/ai-agents/agent-controller/helm/templates/deployment.yaml,infra/ai-agents/agent-controller/helm/values.yaml - Dependencies: TASK-009
- Acceptance criteria:
- Write agents get dedicated PV/PVC (
agent-ws-<jobName>) - Read-only agents continue using shared
agent-workspacePVC -
canRunWrite()serialization removed - PV/PVC cleaned up after job completes or fails
- RBAC includes PVC verbs (namespaced) and PV ClusterRole (cluster-scoped)
- Two simultaneous publisher runs both start (not serialized)
-
go buildandhelm templatepass
- Write agents get dedicated PV/PVC (
TASK-012: GitHub App auth for git push and PR creation
- Requirement: Agents push branches and create PRs autonomously
- Files:
infra/ai-agents/agent-controller/pkg/controller/controller.go,infra/ai-agents/agent-controller/main.go,infra/ai-agents/agent-controller/helm/templates/secret.yaml,infra/ai-agents/agent-controller/helm/templates/deployment.yaml,infra/ai-agents/agent-controller/helm/values.yaml,apps/blog/bin/run-publisher.sh - Dependencies: TASK-011, GitHub App setup (manual)
- Acceptance criteria:
- Controller generates JWT and exchanges for installation token
-
GITHUB_TOKENinjected into write agent pods - git-sync uses token-authenticated URL for clone
-
run-publisher.shpushes branch and creates PR - Branch protection prevents direct push to main
- Secrets template includes
GITHUB_APP_ID,GITHUB_APP_PRIVATE_KEY,GITHUB_INSTALL_ID
TASK-010: Update wiki documentation
- Requirement: PRD acceptance criterion "documented procedure to regenerate credentials"
- Files:
apps/blog/blog/markdown/wiki/devops/agent-controller.md,apps/blog/blog/markdown/wiki/custom-tools/docker-images/ai-agent-runtime.md - Dependencies: TASK-009 (document what actually works)
- Acceptance criteria:
- Agent controller wiki documents: publisher task, webhook trigger example, Max auth credential rotation procedure
- Runtime image wiki documents: new base image, Playwright inclusion, version 0.3
- Credential rotation procedure: step-by-step for regenerating
CLAUDE_CODE_OAUTH_TOKENviaclaude setup-tokenand patching the K8s secret
Implementation Additions
Changes made during TASK-009 deployment that weren't in the original design.
-
Dropped google-news MCP from controller. The journalist agent uses WebSearch/WebFetch instead of the custom google-news MCP server. Removes the
npm cisetup step from the agent command, simplifying startup and eliminating a permission issue (node_modules owned by wrong UID). -
UID unified to 1001 for all agents. The design specified UID 1000 for non-publisher agents and 1001 for publisher. Since all agents now use the Playwright-based runtime image (pwuser=1001), the controller chowns to 1001 for all agents.
-
Dropped git push, PR creation, and Discord webhook from run-publisher.sh. The agent writes to a local branch on the PVC. Kyle reviews from the filesystem. GitHub App integration for push/PR will come later.
-
Discord #log channel for controller observability. The controller posts to Discord #log on job start (with UUID, agent, prompt preview) and on job completion/failure. Uses the Discord bot API directly from Go, not MCP.
-
Per-branch PVCs for write agents. The shared PVC caused two problems: stale state leaking between runs (git worktree artifacts, node_modules from previous runs) and OOMKill when parallel resource contention occurred. Write agents now get isolated PV/PVC pairs (
agent-ws-<jobName>) with hostPath under/tmp/agent-workspace/branches/. Read-only agents keep the shared PVC. ThecanRunWrite()serialization was removed since per-branch PVCs eliminate the need for write serialization. -
GitHub App auth for git push and PR creation. The
PericakAIGitHub App (ID 3100834, installed on kylep/multi only) provides scoped auth. The controller generates a JWT, exchanges it for a short-lived installation token (1hr), and injects it asGITHUB_TOKENon write agent pods.run-publisher.shuses this token forgit pushandgh pr create. Branch protection on main prevents direct pushes — agents can only create branches and PRs. -
Switched
--output-format textto--output-format stream-json. Enables streaming structured output to pod logs for real-time monitoring viakubectl logs -f. -
--allowedToolsrequired for headless Claude Code. Discovered that Claude Code in headless mode (-pflag) blocks all tool use unless--allowedToolsis explicitly passed. The journalist CRD already had this; webhook-triggered tasks must include it too.
Portability
The agent controller now has a reproducible setup path:
secrets/export-agent-controller.sh.SAMPLElists every env var the controller needs, with comments on where to get each value.infra/ai-agents/agent-controller/bin/setup.shbootstraps from scratch: checks prereqs, sources secrets, runs helm install, patches the K8s secret. Optional--build-imagesflag builds and pushes both Docker images first.
What's automated: namespace creation, helm install/upgrade, secret patching (including base64-encoding the PEM key).
What's manual: populating the exports file, GitHub App setup (creating the App, noting the install ID), and branch protection rulesets on main.
Open Questions
-
claude setup-tokenscope regression (PRD OQ #1). Issue #23703 reported a scope regression. TASK-001 validates this before any implementation begins. Ifsetup-tokenis broken, the entire project is blocked until Anthropic fixes it or a workaround is found. -
OAuth refresh token race condition. Issue #24317 reports race conditions with concurrent sessions. Since we only run one publisher at a time (write serialization), this should not apply. Monitor during TASK-009.
-
Playwright image version pinning. The design specifies
v1.58.2-noblebut should track the latest stable Playwright release at implementation time. Pin to a specific version, don't uselatest. -
ghCLI auth in container. The script needsGITHUB_TOKENenv var forgh pr create. Verify thatghreads this env var without requiringgh auth loginfirst. -
Discord notification mechanism. The entrypoint script can either use a simple Discord webhook URL (curl POST) or invoke the Discord MCP server. The webhook is simpler and doesn't require MCP. Decision deferred to TASK-003 implementation.
Risks
-
Auth instability (PRD risk #1).
CLAUDE_CODE_OAUTH_TOKENis a community workaround. No fallback by design. Mitigation: TASK-001 validates before investing in other tasks. If auth breaks post-deploy, publisher runs stop until Anthropic provides a fix or alternative. -
Image size regression. The Playwright base image is ~1.5GB vs ~200MB for Alpine. This affects pull times and disk usage. Mitigation: image is cached on the node after first pull. Acceptable tradeoff for browser verification capability.
-
Playwright base image breaks existing agents. Switching from Alpine to Ubuntu could break Python/Node paths or MCP server deps. Mitigation: TASK-002 includes journalist regression test. TASK-009 includes full regression.
-
Token expiry mid-run (PRD risk #4). Publisher pipeline can take 30+ minutes. If the 1-year token expires during a run, it fails partway. Mitigation: token lasts 1 year; risk is negligible unless credentials are not rotated. Document rotation procedure in TASK-010.
-
Promotion deadline (PRD risk #2). March 28 deadline for doubled off-peak limits. Tasks are ordered for fastest path to a working prototype: TASK-001 (validation) gates everything, then TASK-002/003 (image + script) can run in parallel, enabling a test run by ~day 5.