Agent Skill in Multi-Agent Systems

People building agents today are mostly doing one-shot. Meaning they write one and that's it. Yesterday, I was watching the YC Lightcone podcast: "Inside Claude Code With Its Creator Boris Cherny" and one of the things Boris, creator of Claude Code and head of Claude Code in anthropic, said is that they delete the CLAUD.MD a lot because they want the new models to take over. That insight tells us a lot that we cannot just settle for whatever prompts we have. Besides that, depending on how we write the prompt, we might use more or fewer tokens; there are ways to better structure agents, workflows, and skills. For this blog post, I will cover some lessons learned while building and improving agents, workflows, and skills. I did a bunch of experiments; in fact, I wrote 7 incarnations of my agent skill. To test the agent's skill, I asked the agent to build a Twitter-like application so I could evaluate the quality of the code and solution as a proxy for the agent's skill. One important callout I want to make is that LLMs still hallucinate, but people don't realize it anymore because they are not really paying attention to the code, and the sophistication of the hallucinations is much more elaborate. I have been saying this since 2023, and I will keep saying that vibe coding is evil, and we need to pay attention to the code. Otherwise, how can you tell what you're doing is right, better, or even optimized? IF you don't care about code, maybe you never cared, but for sure you care about the AWS bill, right?

The Agents

I built a couple of custom agents in order to shape, direct, and experiment with software engineering agents. The agents I build are:

Basically, I start with 5 agents for engineering, 5 for testing, and 6 for review and documentation tasks. Grand total of 16 agents. One thing I learned pretty quickly is that more agents, it's not the way, you get a huge increase in token usage, and the quality of the solution is not better. I was paying agent tax. Bottom line, more agents are not the best approach, and you have even less determinism.

You might wonder why the core agents are 3 if there are 5 files (so there are 5 agents). That's right; however, for the skill, I ask the user to choose one backend stack, with the options being my preferred choices: Java, Rust, or Go.

One thing I did was to create a Markdown file for each agent. I built a simple tool in Rust that can deploy agents to Claude code. When deploying in Claude code, there are 2 things I'm doing. First, I'm deploying each agent as custom commands, you can trigger them directly /agent-alias as you can see it here in my Claude Code:

The second thing I did was to create and skill, the skill also has a custom command, so with one slash command I can trigger the whole workflow.

IF you wonder how I did the ad:wf it's because there is a folder called ad them inside a file called wf.md

Custom Command

The command is pretty simple; it just instructs Claude to use the skill.

Agent Skill

The skill itself is more complicated, and thats why I played 7 times to tweak it and make it better. This was the first version, V1, of the Skill. For this version, I made the following mistakes:

No Mistakes Tracking: So the model could always be producing the same mistakes.
How the Skill was created: V1 was all direct prompting.
No Control: No control over phases or gates.
Lack of progress Tracking: No tracking file like todo.txt or todo.md.
Lack of some testing: Did not explicitly ask for frontend tests
Runtime Verification: I did not instruct the skill to verify the program at runtime (Is this obvious?) - Really surprised that I had to say that to the LLM models instead of being the default behavior.

The first version does not matter if you don't stop there, it's just your first baseline, what happens is to continue to iterate and keep changing and doing little experiments. Here it's a preview of the Skill in action on Claude Code:

The Skill also allows the user to choose the phases or even skip some phases:

Here you can see the multi-agent system work, or simply the agents at work:

Learning via Deming Cycles and Lean

Who knows me, knows I'm a big believer in Lean and Deming. The most important thing is not to settle. Claude Code is addictive, and it's very fast; it's so easy to call the day and move on. Especially when people keep praying Vibe Coding (which is a big mistake). Perhaps this is a human condition: we always want to scale something before making it sustainable and work. I saw this frenzy to scale in past movements like:

Agile: OMG, Agile does not scale. We need JIRA, we need SAFE. The result was WASTE and people doing lots of things they didn't understand, because we had to scale to everyone very fast.
DevOps: Perhaps the worst of all, from a movement was turned into a department or a team. Where also most of the principles got twisted and lost because people had to scale to all very fast.
Microservices: Let's all do microservices like crazy, all must be microservices, what about a shared database? Dont bother, just keep creating services.

When companies move very fast without understanding the principles and digesting the things they are doing, control is lost, and a lot of WASTE is created. AI has the potential to be the worst wave of waste we ever seen, because AI is also mystified, the robots are coming, developers will cease to exist, and other lies. Pure hype leads to pure WASTE. It will not take long for someone to come up with Lean AI. Like we saw lean startups, lean hospital, and DevOps (originally a form of Lean). IF you take one thing from this post, it's STOP, Wake up, and think. Don't call the day so fast, reiterate, experiment, and keep learning; don't assume anyone knows what they are doing.

Lessons Learned

Applying Lean to my agent skill, I made 5 waves of changes to see what worked best. This is what I did after version 1. On the first version of the skill, I had way too many agents, and that's not good. As Paddo calls it out on the brilliant 19 Agents Trap. The Final version was very different. Here is V5 (Final). Here are some of my learning on each wave of experimentation:

Considering token usage, V5 sucks more tokens than V1; V5 is important because it reinforces verification checks. V1 was spawning 12 agents while V5 only spawned 5. Plus, the addition of feedback loops via mistakes.md, Zero tolerance loop, by asking the agents to check the STDERR, warnings, and actually run the script and verify they work.

What's cool about this skill is that it shapes and directs how people can work with agents. By asking them to choose from pre-defined stacks and by forcing a variety of tests happen. Forcing AI Shift left to happen.

How to Make it Better?

When building multi-agent systems, here are some recommendations for agent skills:

Don't stop on V1, keep iterating, keep making it better
Don't create a lot of agents, because it's a tax.
3-5 Agents is ideal rather than 10 or more agents.
Keep an eye open on token consumption.
Keep an eye on the skill language; there are ways to convey information without repetition, like my rules section.
Pre-select stacks and choose the frameworks and libraries that make sense for you.
Make sure to tell the model about your choices; the model might make poor choices, like integration tests with bash and curl.
Don't teach the LLM its own tools
Declarative> Prose - Global Context + Rules, it's better than narratives, like "When the agent does that..."
Consolidate several agents into one agent, like one tester agent, rather than 10 specialized agents.
Tracking mistakes via mistakes.md it's a very good idea.
Compilation != Runtime - Force the agents to verify and to check for warnings and errors.
Design Doc before execution and before
Progress tracking via todo.txt or todo.md (You don't need beads)
Look at the final code, otherwise you can't tell if it's better or not
Don't settle for what is good now; it might not be good in 6 months.

Cheers,

Diego Pacheco

Search This Blog

Diego Pacheco Tech blog