Multi-Agent Systems and AI Transformations

AI everywhere, agents everywhere. We just finished the first quarter of 2026, and a lot happened in those first 3 months. It feels like 3 years have passed, not three months. Opus 4.6 really changed the game, but software engineering and distributed systems are not solved by agents; we saw hype at the top of the hype, with huge, unrealistic expectations that need to be dialed back and properly adapted. The less you know about AI and Agents, the more impressed you are, and no, you cannot get rid of all engineers. AI Alone does nothing and cannot self-verify to the point that systems of systems can be completely automated and run hands-off. Maybe we will get there one day, but we are not there, and no matter what people say, no one can predict when this will happen. Could be 30 years or even more. Karparthy already said that: Zero to Demo is easy. Demos are not impressive anymore. Zero to production is still a very different story.

Multi-Agent Systems: Process as Agents

Paddo, in his masterpiece 19 Agents Trap, already called out bout the risks of having way too many agents and blowing the context window. Considering LLMs and agents, everything we go to goes into the context window. Your Claude.MD, you prompt, you tools execution, your chat history, everything. We know that your Claude.md needs to be lean, and you cannot use all your context window; otherwise, a lot of bad things happen. More agents mean more coordination and an increase in determinism.

We start the year with Gas Town, which was crazy and innovative but still not ready for prime time. People forget that cost is still prohibitive; if it exceeds what people can afford, the whole thing collapses. And yes, we keep hearing models are cheaper, which is not true; models use tokens like never before, and token consumption just increases. Tokens are not even a standard measurement unit across different providers.

We also saw the rise of many Claude code superchargers. Which are frameworks on top of Claude code. To quote a few: SuperClaude, Ruflow(former claude flow), cook, Ralph, ContinuousClaude, BMAD-METHOD (perhaps the worst of all), GSD, OH My Claude Code, StrongDM/Attractor , and many others. Although some of these solutions have some interesting ideas, I found that lots of those suck mode tokens and do not necessarily deliver better results.

Agent skills have also become a thing, as the new anthropic way to avoid Local MCPs. Before agents, all SDLC was made by people and ceremonies. Lots of these ceremonies were always a waste and never made sense like daily meetings. However, agents are making SDLC collapse much faster.

What is Claude Code after all?

The problem is that Claude's code today is very different from what it was 1 year ago, or even 6 months ago. Claude went from a chat to multi-agent systems, and was very inspired by the community's ideas, like Gas Town and all those frameworks built on top of Claude's code. From multi-agent systems Claude code is also like a platform because you can code on the web, cli, desktop, and teleport, even from a mobile phone. It's a very big product at this point for sure.

What people don't realize is that Claude code was a single agent, and a chat was called Claude code. Now with multi-agents (Claude teams), it has the same name Claude Code, Claude incorporates several of the ideas from the community back to Claude, and it's 100% different than it was 3 months ago, but it's called Claude Code.

PS: This timeline is not 100% accurate, and Codex is more OSS-friendly - but I hope this drives the point.

This is a funny effect. Imagine that you have a primitive rock called rock, that rock becomes a lizard, and that lizard is called rock, then the lizard turns into a cheetah, and it's called rock. Now the rock evolves to be a spacecraft and is still called a rock. That for sure does not make it easy to make sense and digest the fundamental changes. If you follow Pokémon and Digimon, the monsters change they name when they evolve. A concept that is strange to Claude code.

I have serious doubts about how Claude code can go being a monolithic solution with hundreds of features. Eventually, it would need to become a platform and have smaller components that could and should be used individually.

The Benchmarks Fallacy

You cannot trust benchmarks. For a couple of reasons. First of all, that is not the reality of software engineering. Secondly, the LLMs are being trained to pass benchmarks, and that is not cool. Models also use tools to look it up and on the web, and that's not really reasoning. Stop being impressed by benchmarks. Private benchmarks are way more interesting because, first of all, LLMs can't train on them, which makes them even more interesting.

Have you ever had the feeling that Claude was slow or a bit cranky some days? Turns out that like any software, they have bugs and sometimes big bugs degrade quality, also it's very easy to get back pressure, especially with AWS bedrock. Don't believe me? Well, look at sites like Margin Lab, which runs SWE bench daily against Opus 4.6, and the results are shocking, as there is a lot of variation.

Another great website for you to watch is OpenRouter. Where they have interesting metrics like: Throughput, Latency, E2E Latency, Tool Call Error Rate, and Structured Output Error Rate per providers like Amazon, Google, and Microsoft.

The Personal Agents Takeover

In the same quarter, we saw a rise in many claw-* solutions. Such solutions are called Personal Agents. Because they are like personal assistants who act on your behalf, doing things for you like: booking a restaurant, buying groceries, managing your agenda, figuring out cheap prices on the web, and other tasks.

The first was OpenClaw(former ClowdBot). After that, there was an explosion of claw-*, to quote a few: ZeroClaw, NanoClaw, PicoClaw, MemoClaw, NemoClaw, Moltis, IronClaw, and many others. Let's not forget the social network for agents: Moltbook. Claw-* solutions push new expectations for people. If that will stand and in fact change people's behavior, only time will tell.

Expectations and consumer behavior take some time to change. But what we are seeing is:

* More time to find deals: As humans, at some point, we give up on things like brand loyalty, because we have other things to do and will not be searching the internet forever. But personal agents are a different story(IF the cost is not prohibitive). Plus, all those dark patterns for buying, buy now or lose, might not fly with agents.

* Agent Experience (AX): Until last year, humans were doing things; now agents might be doing things on behalf of humans, so if agents are buying and using sites, they dont need HTML and the traditional UX for humans. One practice opportunity could be rediscovering REST and content negotiation. In other words, human? Get HTML. Agent? Get text or another structured format like JSON.

* Patience could be even shorter: Social media, mobile devices, TikTok, tik-tok and other advances already train us to have less and less patience. With LLMs, we have results in seconds; with personal agents, we are doubling down in that direction, where fewer and fewer people will have the patience to wait. The danger here is that even on the AI, there is something that takes time and will require long-term thinking.

Like I said, only time will tell if we will see this shit or not. But we do need to watch it.

Another interesting effect is happening in enterprise companies. Where security was always understood, where zero trust was always the default. But now the agents want permissions that users never had. That also changes the expectations and puts more pressure on security.

The Security Nightmare

IF you work with Infosec, you have a job forever. You also have a whole nightmare happening very fast.

There is a lot going on, but let me focus on the 2 biggest things that happened in this quarter. First, the LiteLLM disaster. LiteLLM is one of the biggest players in AI Agent gateway solutions for enterprise. Compromised libs by a malicious package were capable of stealing credentials. If anyone had doubts that key rotation needs to happen all the time, those doubts are now gone.

Perhaps the scariest of all, the Axios Rat. NPM was never in good shape in terms of security, but now things are worse than ever. I wrote a script to check all my repositories for the rat. Then I realized Claude's code could also be affected. In decades working with software, it's the first time I felt insecure and tought OMG Claude code can get me hacked. I really want to port Claude's code to Rust and move away from JS and NPM.

AI Transformation: Guardrails

Guards are very important to establish safety. Perhaps we should learn from Amazon's mistakes with Kiro, SDD, and other anti-patterns. Guardrails are not just traditional enterprise compensating controls. My take on guardrails is this:

These are the 4 fundamental building blocks for proper guardrails. All of these elements are code-based and provide determinism, and can catch agent mistakes before they cost companies too much. Let's take a look at each one of these elements.

1. Automated Tests: We need tests more than ever. I'm saying this for years at this point. Before 2023, we had humans, and humans feared change and were careful with it. Agents have no fear and will break all components all the time. We cannot trust agents to be deterministic, and for sure, we cannot count on them to be careful. What tests do we need? All of them. Unit, Integration, Chaos, Stress, Contract, Snapshot, E2E, CSS, InfraTesting, ObservabilityTesting, PropertyBased Testing, Mutation Testing. Once you have good tests, they will catch mistakes agents might make. Plus advanced techniques like state induction and Testing Interfaces.

2. Observability: If we have good metrics, we can have good dashboards and good alerts. We can build a system and can self-heal, or at a minimum, we can catch problems sooner before they get a bigger blast radius.

3. AI Agent Gateway: Solutions like Portkey, LiteLLM, AWS AgentCore, RouterLLM, OpenRouter, provide a central layer where failover, routing, filtering, and rules can be enforced. This is important because it gives companies a central place to block leaks of PII, for instance(for Claude code and AI agents at least).

4. CI/CD: CI/CD was always needed. DevOps as a movement has been pushing it for decades. However, companies never went all the way and released trains/release calendars that dominated all industries. However, due to tech debt and often poor architectural decisions that lead to monoliths and distributed monoliths, achieving this task is not trivial. However is needed. Because they provide the final keystone for proper guardrails. True CI (without branches) enables issues to be anticipated and fixed daily. Real CD allows reducing deltas, allows a canary to reduce the blast radius, and, with split traffic and progressive rollout patterns, gives us the ability to dial slowly and with confidence before affecting all users.

Now these are the visible elements. There are 2 invisible elements, being one architecture and the second one, Tech Debt. Companies love to ignore and pretend tech debt does not exist, but it does.

People often confuse architecture with bad architecture, bad decisions, and bad abstractions. Architecture will always be needed. There is no context window that can fit all the software that big tech has. We always need to make decisions. What it's perhaps also confused with is technical debt. Bad architecture, bad decisions, and bad abstractions are technical debt.

Bad architecture prevents testing because it is not testable. Bad architecture prevents CI/CD because distributed monoliths cannot be release-independent. Having a central AI LLM Gateway is not the fix for all tech debt that was ignored.

I never liked architects as gatekeepers.

Claude Code is a multi-agent system nowadays and gives amazing gains in productivity. However, we cannot give up on safety and hope things go well. We need to invest in proper guardrails and increase safety through deterministic engineering solutions to counterbalance how multi-agent systems operate. Code review is not enough; it needs to change. As awesome as agent skills are, they are not the whole picture, and we need all guardrails in place.

Cheers,

Diego Pacheco

Search This Blog

Diego Pacheco Tech blog