For fresh installs, use infra/agent-controller/bin/setup.sh
instead of following this runbook manually. See the
design doc portability section.
Status: In progress — publisher running with full observability
| Component | Version | Status |
|---|---|---|
| ai-agent-runtime | 0.4 | Pushed, includes hooks + Playwright + jq |
| agent-controller | 0.6 | Pushed, Discord #log + RUN_ID + stream-json |
| Helm revision | 33 | Deployed |
Problem: Previous runtime (0.2) ran as UID 1000. New Playwright
base runs as UID 1001 (pwuser). npm ci failed with EACCES on
node_modules owned by the old UID.
Fix: Controller now chowns to 1001 for all agents, not just publisher.
Problem: kubectl patch and Helm both claim ownership of secret
fields. helm upgrade fails with "conflicts with kubectl-patch".
Fix: Delete secret, let helm recreate it, then re-patch values. Documented in agent-controller wiki.
Problem: helm upgrade -f values.yaml resets secrets to empty
defaults. The lookup pattern only preserves existing values when the
secret already exists — but after a delete-and-recreate, everything
starts empty.
Fix: Always re-patch secrets after helm operations. OAuth token
lives in apps/blog/exports.sh.
--allowedTools required for headless modeProblem: Claude Code in headless mode (-p flag) blocks all tool
use unless --allowedTools is explicitly passed or the agent
definition has tools: in frontmatter. The journalist agent has both
but was missing WebSearch/WebFetch in the CRD. The publisher agent's
frontmatter grants tools directly.
Fix: Journalist CRD updated with WebSearch/WebFetch. Runtime image
now includes settings.json with wide permissions to prevent prompts.
Problem: --output-format stream-json does not include subagent
events by default. Publisher subagent calls (researcher, reviewer, QA)
produced 5-10 minutes of complete silence.
Fix: Added --include-partial-messages flag. Subagent events now
appear in the parent stream with parent_tool_use_id populated. Full
real-time visibility via kubectl logs -f.
stream-json requires --verbose with -pProblem: --output-format stream-json exits with error
"stream-json requires --verbose" when used with -p (print mode).
Fix: Added --verbose flag to all claude invocations.
Problem: npm ci for google-news MCP server failed due to
UID mismatch on cached node_modules. The journalist agent doesn't
need it — WebSearch works.
Fix: Removed google-news MCP from controller command building. Journalist uses WebSearch/WebFetch.
Problem: Design doc specified Discord webhook URL for publisher notifications.
Fix: Publisher notifications handled by Discord MCP bot (same as journalist). Removed DISCORD_WEBHOOK_URL, GITHUB_TOKEN, and git push/PR from run-publisher.sh. Agent writes to local PVC branch.
Problem: The QA subagent starts next dev to verify blog post
rendering. In a K8s pod (4GB node, ~1.5GB available), Next.js page
compilation on first request either hangs indefinitely or triggers an
OOMKill. The first publisher run ended in OOMKilled status after 37
minutes, during the QA phase.
Diagnosis: Debug pod tests confirmed:
next dev: TCP connects but HTTP response never arrives (compilation hangs)python3 -m http.server on static files: HTTP 200 in 0.005sFix: Added apps/blog/bin/start-static-server.sh — builds static
files with bin/build-blog-files.sh, then serves out/ with
python3 -m http.server 3000. Updated qa.md to use this in container
environments.
Problem: MarkdownService.js imports ./RemarkMermaid.js but the
file on disk is remarkMermaid.js (lowercase r). macOS is
case-insensitive so this works locally, but Linux containers are
case-sensitive and the build fails with "Module not found."
Fix: Changed import to ./remarkMermaid.js.
Write agents (publisher, qa, journalist) now get dedicated PV/PVC pairs
instead of the shared agent-workspace PVC. This eliminates stale state
leakage between runs and removes the canRunWrite() serialization that
prevented parallel runs.
agent-ws-<jobName> (jobName is <taskName>-<timestamp>)/tmp/agent-workspace/branches/<jobName>/agent-workspace PVCRBAC changes: PVC verbs added to namespaced Role, new ClusterRole + ClusterRoleBinding for cluster-scoped PV management.
The PericakAI GitHub App provides scoped git auth for write agents.
The controller generates a JWT signed with the App's private key,
exchanges it for a short-lived installation token (1hr), and injects
it as GITHUB_TOKEN on write agent pods.
Setup steps:
kubectl -n ai-agents patch secret agent-secrets --type merge -p \
"{\"data\":{\"GITHUB_APP_PRIVATE_KEY\":\"$(base64 -w 0 < secrets/pericakai.private-key.pem)\",\"GITHUB_APP_ID\":\"$(echo -n 3100834 | base64 -w 0)\",\"GITHUB_INSTALL_ID\":\"$(echo -n <INSTALL_ID> | base64 -w 0)\"}}"