BlogWikiAbout

Kyle Pericak

"It works in my environment"

Created: 2026-03-15Updated: 2026-03-15

Applying Research to Claude Agents

Category:devTags:claude-codeai-agentsmcp
Research-backed guidance on getting coding agents to do useful work, applied to rebuild a 7-agent roster from Anthropic's published advice and three deep research reports.

I wanted research-backed guidance for my Claude Code agents. I had Claude read everything Anthropic has published on agent design, then ran deep research reports from ChatGPT, Gemini, and Claude on the same topic. Notes are in the wiki (Anthropic guidance, deep research synthesis). Here's what I applied.

Some key findings:

  • Most multi-agent failures are specification problems, not infrastructure. Bad task descriptions, ambiguous instructions, missing termination conditions. Berkeley's MAST taxonomy analyzed 150+ tasks across 5 frameworks and found specification and system design issues were the dominant failure category.
  • Teach delegation. Give subagents an objective, output format, tool/source guidance, and task boundaries. Without all four, agents duplicate work or leave gaps. (Multi-Agent Research System)
  • Evaluators need isolated context. If the evaluator shares context with the producer, it becomes "another participant in collective delusion." (Building Effective Agents)
  • Start wide, then narrow. Agents default to overly specific queries in unfamiliar territory. Begin with broad queries to map what exists before drilling in. (Gemini deep research, wiki)
  • Action-oriented descriptions improve auto-delegation. Trigger phrases like "MUST BE USED when..." in agent descriptions help Claude decide when to invoke subagents. (Claude deep research, wiki)
  • Agents without browser tools ship half-working features. Unit tests pass but the rendered page is broken. Playwright catches what tests miss. (Effective Harnesses)
  • Route models by risk, not task complexity. Opus for judgment calls that are hard to verify. Sonnet for mechanical work that's easy to check. A complex summarization is low-risk because you can read and discard it. A one-line security review is high-risk because getting it wrong has consequences. (Building Effective Agents, all three deep research reports)
  • Deny-by-default permissions. Start with no tool access and progressively allowlist. The opposite of what most people do (give the agent everything, restrict when something goes wrong). (Context Engineering, Claude deep research, wiki)
  • The harness enforces termination, not the agent. Don't rely on the agent to know when to stop. Set max iterations, token budgets, and repetition detection in the system that runs the agent. (Effective Harnesses, Claude deep research, wiki)
  • Write results to files, pass back references. Subagents writing full output through the orchestrator bloats its context window. Subagents should write to files and return a summary and a path. (Multi-Agent Research System)
  • Research, plan, then implement. Without explicit instructions, Claude jumps straight to coding. Say "read X, then make a plan, then implement." Three phases, stated explicitly. Claude won't reliably do them on its own. (Claude Code Best Practices)
  • One feature per prompt, commit after each piece. Agents try to do everything and exhaust context mid-task. Commits are rollback points. If the agent breaks something on step 3, you revert to step 2 instead of starting over. (Effective Harnesses)
  • Agents use ~4x more tokens than chat. Multi-agent uses ~15x. Don't pay that unless the task needs it. Multi-agent is an economic decision, not an architectural one. (Multi-Agent Research System)
  • Start simple, add layers only when measured outcomes improve. The canonical agent loop is ~50 lines. Don't build the full stack on day one. (Building Effective Agents)

Security baseline: the OWASP Top 10 for LLM Applications. All three deep research reports recommended it. I use it as a checklist for the Security Auditor agent.

2026-03-15's Agent Roster

The agents from the org-chart post were made off-the-cuff. 16 agents, too ambitious, didn't have the infrastructure to support them. I deleted most and got down to 7. This time: I had Claude load the deep research synthesis, I extracted the best practices that applied, and I updated the agent definitions with specific, traceable changes.

Agent Model Role & Applied Research
Publisher Opus Content pipeline orchestration, writing. Delegation specs, trigger phrase.
Analyst Opus Research ingestion, system improvements. Trigger phrase.
Synthesizer Opus Compare/contrast Deep Research reports. No changes needed.
Researcher Sonnet Sourced facts, research briefs. Start wide, then narrow search strategy, trigger phrase.
Reviewer Opus Style, substance, sourcing. Context isolation from producer.
QA Sonnet Build, render, links via Playwright. Trigger phrase.
Security Auditor Opus OWASP LLM Top 10, confidential data. Added LLM05 check.
graph TD
    Claude["Vanilla Claude"]
    Publisher["Publisher (Opus)"]
    Analyst["Analyst (Opus)"]
    Synthesizer["Synthesizer (Opus)"]
    Researcher["Researcher (Sonnet)"]
    Reviewer["Reviewer (Opus)"]
    QA["QA (Sonnet)"]
    Security["Security Auditor (Opus)"]

    Claude --> Publisher
    Claude --> Analyst
    Claude --> Synthesizer
    Claude --> Researcher
    Claude --> Reviewer
    Claude --> QA
    Claude --> Security
    Publisher --> Researcher
    Publisher --> Reviewer
    Publisher --> QA
    Publisher --> Security

Three report directly to me (Publisher, Analyst, Synthesizer). Four are subagents under Publisher. Model routing follows the research: Opus for judgment calls (review, security, editorial, synthesis), Sonnet for mechanical work (research, QA).

Calling the agents

Three ways to use them:

Vanilla Claude auto-delegates. Each agent definition has a description field that Claude reads when deciding whether to spawn a subagent. The trigger phrases ("MUST BE USED when writing blog posts", "MUST BE USED when gathering facts") tell Claude when to invoke each one. If I ask vanilla Claude to write a blog post, it sees the Publisher's description and delegates. It can also run multiple subagents in parallel when the tasks are independent.

CLAUDE.md tells Claude what's available. The project's CLAUDE.md includes a section listing all 7 agents with their roles and invocation commands. This gives vanilla Claude the full picture of what agents exist without having to discover them from the file system.

claude --agent publisher for the full pipeline. This starts Claude with the Publisher's system prompt as the main thread. Publisher then orchestrates the entire content pipeline: research, substance gate, write, review, QA, security audit. It spawns each subagent natively using Claude Code's Agent tool, not CLI calls. Independent stages (QA and security audit) run in parallel. This is the heavy path for writing a full post from scratch.

Claude Code enforces a single-level subagent hierarchy. Publisher can spawn subagents, but those subagents can't spawn their own. If Reviewer needs research, it doesn't call Researcher. It reports back to Publisher, which decides what to do next.

Blog code last updated on 2026-03-15: c04b780f9a9b20e56525019354100252a1c20141