Kyle Pericak

"It works in my environment"

Bot-Wiki/Design Docs/Autonomous Publisher Pipeline -- Design Doc/TASK-009: Build, Deploy, and End-to-End Test

TASK-009: Build, Deploy, and End-to-End Test

For fresh installs, use infra/ai-agents/agent-controller/bin/setup.sh instead of following this runbook manually. See the design doc portability section.

Status: In progress — publisher running with full observability

What was deployed

Component	Version	Status
ai-agent-runtime	0.4	Pushed, includes hooks + Playwright + jq
agent-controller	0.6	Pushed, Discord #log + RUN_ID + stream-json
Helm revision	33	Deployed

Issues found and fixed during deployment

1. Workspace file ownership (UID mismatch)

Problem: Previous runtime (0.2) ran as UID 1000. New Playwright base runs as UID 1001 (pwuser). npm ci failed with EACCES on node_modules owned by the old UID.

Fix: Controller now chowns to 1001 for all agents, not just publisher.

2. Helm field manager conflicts

Problem: kubectl patch and Helm both claim ownership of secret fields. helm upgrade fails with "conflicts with kubectl-patch".

Fix: Delete secret, let helm recreate it, then re-patch values. Documented in agent-controller wiki.

3. Secrets wiped by helm upgrade

Problem: helm upgrade -f values.yaml resets secrets to empty defaults. The lookup pattern only preserves existing values when the secret already exists — but after a delete-and-recreate, everything starts empty.

Fix: Always re-patch secrets after helm operations. OAuth token lives in apps/blog/exports.sh.

4. `--allowedTools` required for headless mode

Problem: Claude Code in headless mode (-p flag) blocks all tool use unless --allowedTools is explicitly passed or the agent definition has tools: in frontmatter. The journalist agent has both but was missing WebSearch/WebFetch in the CRD. The publisher agent's frontmatter grants tools directly.

Fix: Journalist CRD updated with WebSearch/WebFetch. Runtime image now includes settings.json with wide permissions to prevent prompts.

5. Zero output during subagent calls

Problem: --output-format stream-json does not include subagent events by default. Publisher subagent calls (researcher, reviewer, QA) produced 5-10 minutes of complete silence.

Fix: Added --include-partial-messages flag. Subagent events now appear in the parent stream with parent_tool_use_id populated. Full real-time visibility via kubectl logs -f.

6. `stream-json` requires `--verbose` with `-p`

Problem: --output-format stream-json exits with error "stream-json requires --verbose" when used with -p (print mode).

Fix: Added --verbose flag to all claude invocations.

7. google-news MCP removed

Problem: npm ci for google-news MCP server failed due to UID mismatch on cached node_modules. The journalist agent doesn't need it — WebSearch works.

Fix: Removed google-news MCP from controller command building. Journalist uses WebSearch/WebFetch.

8. Discord webhook URL not needed

Problem: Design doc specified Discord webhook URL for publisher notifications.

Fix: Publisher notifications handled by Discord MCP bot (same as journalist). Removed DISCORD_WEBHOOK_URL, GITHUB_TOKEN, and git push/PR from run-publisher.sh. Agent writes to local PVC branch.

9. QA subagent OOMKill — next dev too heavy

Problem: The QA subagent starts next dev to verify blog post rendering. In a K8s pod (4GB node, ~1.5GB available), Next.js page compilation on first request either hangs indefinitely or triggers an OOMKill. The first publisher run ended in OOMKilled status after 37 minutes, during the QA phase.

Diagnosis: Debug pod tests confirmed:

next dev: TCP connects but HTTP response never arrives (compilation hangs)
python3 -m http.server on static files: HTTP 200 in 0.005s
Networking is fine; the issue is purely Next.js resource consumption

Fix: Added apps/blog/bin/start-static-server.sh — builds static files with bin/build-blog-files.sh, then serves out/ with python3 -m http.server 3000. Updated qa.md to use this in container environments.

10. Case-sensitive import breaks Linux build

Problem: MarkdownService.js imports ./RemarkMermaid.js but the file on disk is remarkMermaid.js (lowercase r). macOS is case-insensitive so this works locally, but Linux containers are case-sensitive and the build fails with "Module not found."

Fix: Changed import to ./remarkMermaid.js.

Observability stack (added during deployment)

Controller → Discord #log: Posts job start (UUID, agent, prompt) and completion/failure
stream-json + --include-partial-messages: Streams all events (parent + subagent) to pod logs
PostToolUse hook: Baked into runtime image, posts tool calls to Discord #log (Write, Edit, Bash, Agent, MCP)
30-min activeDeadlineSeconds: Hard ceiling on all jobs

Per-branch PVCs (TASK-011)

Write agents (publisher, qa, journalist) now get dedicated PV/PVC pairs instead of the shared agent-workspace PVC. This eliminates stale state leakage between runs and removes the canRunWrite() serialization that prevented parallel runs.

PV/PVC name: agent-ws-<jobName> (jobName is <taskName>-<timestamp>)
hostPath: /tmp/agent-workspace/branches/<jobName>/
Cleanup: controller deletes PV+PVC after job completes or fails
Read-only agents still use the shared agent-workspace PVC

RBAC changes: PVC verbs added to namespaced Role, new ClusterRole + ClusterRoleBinding for cluster-scoped PV management.

GitHub App auth (TASK-012)

The PericakAI GitHub App provides scoped git auth for write agents. The controller generates a JWT signed with the App's private key, exchanges it for a short-lived installation token (1hr), and injects it as GITHUB_TOKEN on write agent pods.

Note (TASK-008): The kubectl patch secret agent-secrets approach below is superseded. Credentials are now stored in Vault and delivered via the Vault Agent Injector. Use bash infra/ai-agents/bin/store-secrets.sh to populate secret/ai-agents/github with the App ID, install ID, and private key PEM file path.

Setup steps (Vault-based, current):

Run bash infra/ai-agents/bin/store-secrets.sh and provide:
- GitHub App ID (plain integer)
- GitHub App private key file path (PEM file — the script reads it with cat)
- GitHub App install ID
Get installation ID: GitHub → Settings → Installations → note ID from URL
Verify App permissions: Contents (R/W) + Pull requests (R/W)
Create branch protection ruleset on main (prevent App from pushing directly)

Verification checklist

Blog code last updated on 2026-05-23: 69fb0a25ee445afedbd0b2098cfb9334ed7b38fb

Kyle Pericak

"It works in my environment"

TASK-009: Build, Deploy, and End-to-End Test

TASK-009: Build, Deploy, and End-to-End Test

What was deployed

Issues found and fixed during deployment

1. Workspace file ownership (UID mismatch)

2. Helm field manager conflicts

3. Secrets wiped by helm upgrade

4. --allowedTools required for headless mode

5. Zero output during subagent calls

6. stream-json requires --verbose with -p

7. google-news MCP removed

8. Discord webhook URL not needed

9. QA subagent OOMKill — next dev too heavy

10. Case-sensitive import breaks Linux build

Observability stack (added during deployment)

Per-branch PVCs (TASK-011)

GitHub App auth (TASK-012)

Verification checklist

4. `--allowedTools` required for headless mode

6. `stream-json` requires `--verbose` with `-p`