Everything that runs inside Rancher on the Mac machines lives in
infra/ai-agents/. Two machines are in scope:
| Machine | Role | Rancher |
|---|---|---|
| M1 | Always-on server — Pai's dedicated AI workstation | Always up |
| kyle-m2 | Gaming laptop — development and ad-hoc testing | Rancher sometimes off |
ArgoCD (running on M1) watches the main branch and auto-syncs both
clusters on every push. When kyle-m2 is offline its apps show as Unknown —
no action needed.
infra/ai-agents/
├── argocd/ ← ArgoCD ApplicationSet manifests
├── agent-controller/ ← Go controller (deprecated, not deployed)
├── ai-agent-runtime/ ← Runtime Docker image (Claude Code + Playwright)
├── cronjobs/helm/ ← Native K8s CronJobs (journalist, pai-morning)
├── pai-responder/ ← Discord polling bot Helm chart
├── vault/ ← HashiCorp Vault Helm values + network policy
├── environments/
│ ├── default.yaml ← Fallback for manual helmfile runs
│ ├── pai-m1.yaml ← pai-m1-specific values (all agents enabled)
│ └── kyle-m2.yaml ← kyle-m2-specific values (all agents disabled)
├── bin/ ← bootstrap.sh, configure-vault-auth.sh, store-secrets.sh
└── helmfile.yaml ← Orchestration (used by bootstrap + ArgoCD fallback)
Three Helm releases make up the active stack:
| Release | Chart | Purpose |
|---|---|---|
vault |
hashicorp/vault |
Secret store with GCP KMS auto-unseal |
cronjobs |
./cronjobs/helm |
Native K8s CronJobs for scheduled agents |
pai-responder |
./pai-responder/helm |
Discord polling bot (M1 only) |
Vault is a hard dependency — CronJob pods and pai-responder both use
Vault Agent Injector to mount secrets at pod startup. The cronjobs
chart also manages the ai-agents namespace, ServiceAccount, NetworkPolicy,
and ResourceQuota (previously owned by the agent-controller).
Each scheduled agent runs as a native K8s CronJob in ai-agents namespace.
suspend: true/false controls whether the schedule fires:
| CronJob | Agent | Schedule (UTC) | M1 | kyle-m2 |
|---|---|---|---|---|
journalist |
journalist | 0 12 * * * (8am ET) | enabled | suspended |
pai-morning |
pai | 30 12 * * * (8:30am ET) | enabled | suspended |
CronJob pods follow this pattern:
/vault/secrets/config/shared/github-token.shclaude --agent <name>, pushes if neededThe journalist posts Discord start/done notifications to #log and
pushes its commit after the agent succeeds. The pai-morning just sends
a greeting to #general with no git writes.
environments/pai-m1.yaml and environments/kyle-m2.yaml control what runs:
# environments/pai-m1.yaml — pai-m1 gets all scheduled agents + pai-responder
paiResponder:
enabled: true
cronjobs:
journalist:
enabled: true
schedule: "0 12 * * *"
paiMorning:
enabled: true
schedule: "30 12 * * *"
# environments/kyle-m2.yaml — kyle-m2 gets the stack but no active workloads
paiResponder:
enabled: false
cronjobs:
journalist:
enabled: false
schedule: "0 12 * * *"
paiMorning:
enabled: false
schedule: "30 12 * * *"
To enable a CronJob on pai-m1, flip enabled: true in pai-m1.yaml and
merge to main — ArgoCD syncs within ~3 minutes.
ArgoCD is installed on M1 (Helm, argocd namespace). It manages all
releases across all registered clusters via ApplicationSets.
Each release has one ApplicationSet in infra/ai-agents/argocd/.
The cluster generator selects every ArgoCD-registered cluster
labeled cluster-role=ai-agents and generates one Application per cluster:
argocd/vault.yaml → vault-pai-m1, vault-kyle-m2
argocd/cronjobs.yaml → ai-agent-cronjobs-pai-m1, ai-agent-cronjobs-kyle-m2
argocd/pai-responder.yaml → pai-responder-pai-m1, pai-responder-kyle-m2
The cluster name ({{name}}) selects the matching values file:
helm:
valueFiles:
- ../../../infra/ai-agents/environments/{{name}}.yaml
Adding a third machine is: register it in ArgoCD, label it, create
an environments/<name>.yaml, done.
All ApplicationSets use:
syncPolicy:
automated:
prune: true # removes resources deleted from git
selfHeal: true # reverts manual kubectl changes
Vault's chart comes from the HashiCorp Helm repo, but its values file lives in this git repo. ArgoCD multi-source handles this:
sources:
- repoURL: https://helm.releases.hashicorp.com
chart: vault
targetRevision: "0.32.0"
helm:
valueFiles:
- $values/infra/ai-agents/vault/values.yaml
- repoURL: https://github.com/kylep/multi
targetRevision: main
ref: values
All secrets are injected by Vault Agent Injector. Vault paths:
| Path | Used by | Contents |
|---|---|---|
secret/ai-agents/discord |
journalist CronJob | discord_bot_token, discord_guild_id, discord_log_channel_id |
secret/ai-agents/github |
journalist CronJob | github_app_id, github_app_private_key (PEM), github_install_id |
secret/ai-agents/anthropic |
journalist CronJob | claude_oauth_token |
secret/ai-agents/pai |
pai-morning CronJob, pai-responder | discord_bot_token (Pai bot), claude_oauth_token, linear_api_key |
secret/ai-agents/openrouter |
(reserved) | openrouter_api_key |
secret/ai-agents/webhook |
(reserved) | webhook_token |
To update secrets:
bash infra/ai-agents/bin/store-secrets.sh
The Vault K8s auth role binds the cronjob-agent ServiceAccount (used
by all CronJob pods and pai-responder) to the ai-agents-read policy.
bash infra/ai-agents/bin/bootstrap.sh
The script is idempotent. On a fresh cluster it:
argocd/argocd CLI)cluster-role=ai-agents)After bootstrap, Vault must be initialized and secrets stored (one-time):
kubectl exec -n vault vault-0 -- vault operator init -format=json \
> ~/.vault-init && chmod 600 ~/.vault-init
bash infra/ai-agents/bin/configure-vault-auth.sh
bash infra/ai-agents/bin/store-secrets.sh
See Bootstrap & Recovery for full Vault walkthrough and secret paths.
Run once from any machine that has kubeconfig for both clusters:
argocd cluster add <kyle-m2-context-name> --name kyle-m2
kubectl label secret -n argocd \
-l argocd.argoproj.io/secret-type=cluster \
cluster-role=ai-agents --overwrite
ArgoCD will immediately begin syncing kyle-m2. Since environments/kyle-m2.yaml
suspends all CronJobs and disables pai-responder, kyle-m2 gets Vault and the
CronJob infrastructure (namespace, ServiceAccount, NetworkPolicy) but
runs no scheduled workloads.
Watch ArgoCD sync status:
kubectl port-forward -n argocd pod/$(kubectl get pod -n argocd -l app.kubernetes.io/name=argocd-server -o jsonpath='{.items[0].metadata.name}') 8080:8080
# open http://localhost:8080
# Password: kubectl -n argocd get secret argocd-initial-admin-secret -o jsonpath='{.data.password}' | base64 -d
# Note: use pod port-forward (not svc) — svc has a socat connection reset issue
Enable a CronJob on M1:
Edit environments/pai-m1.yaml, set enabled: true, merge to main.
Force immediate sync:
argocd app sync ai-agent-cronjobs-pai-m1
See what's running:
kubectl get cronjobs -n ai-agents
kubectl get jobs -n ai-agents
kubectl get pods -n ai-agents
Watch a CronJob run in real time:
kubectl -n ai-agents logs -f <pod-name> -c agent
Manually trigger a CronJob:
kubectl create job --from=cronjob/journalist journalist-manual -n ai-agents
Kill a stuck job:
kubectl delete job <job-name> -n ai-agents