AI coding Agents Evolution

AI coding Agents like Claude Code, OpenAI Codex, and Gemini CLI have disrupted how software engineering is done. IMHO, the most disruptive agents are Claude code and Codex. However, a lot of things already happened, some progress has been made, and there is some evolution in the space. We saw the birth of custom and subagents to avoid passing the whole context window down, custom commands to have more control over a workflow, or when a specific task is executed. Hooks add more determinism and make sure tests and linters are executed as part of the guardrails. From the explosion of MCPs to Multi-Agent Systems. There are many interesting changes and evolutions happened, we learned somethings while some things are still to be learned. For this blog post, I will cover some of the evolution in AI coding agents (mainly around Claude code). I did a lot of POC with agents, 74 Agent-related POCs at the moment. One thing I keep saying is that POCs are getting expensive, now not only do we need pay AWS plus multiple LLM providers, but more and more POCs are getting salty. A lot had happen this JAN/FEB.

AI Coding Agents

The first wave of evolution was the birth of code agents. Copilot and later Cursor were pioneers in this space. Where the IDE was to have a Copilot, an assistant, while you are still the pilot. First incarnations of these tools were pretty much part of the IDE (VScode, IDEA, and others). Such tools were primarily focused on auto-complete.

Quickly, such tools evolved to have a chat and then execute actions for you, where now you dont need to copy and paste anymore. The word "Agentic" described something that was not a fully autonomous agent but had some "agentic" properties. Suddenly, files could be created, edited, and deleted.

From Copilot to Claude Code and Codex, things changed quite a lot. Considering Claude's code, the terminal becomes the new place to be, pretty normal for backend engineers, maybe a bit strange for frontend engineers and Normies. Claude Code changed everything.

Not only has typing speed improved, but Opus 4.6 has also gotten much better than previous models. I can barely use sonnets nowadays. In parallel, another change was happening.

From Markdown to Code

Not long ago, there was an explosion in MCPs. The issue with MCP was that they were preloaded into the context window, regardless if whether you used them. Plus, MCPs were all about text. That was pretty inefficient. What was discovered was that there was a better way, instead of pre-loading a bunch of text, to discover things on demand.
Such a discovery took a couple of years, like:
Progressive Disclosure: Progressive Disclosure instead of pre-loading a lot of text into the context window (and maybe never using it). What's best is to just give the model a hint or pointer and let it discover more on demand.
From Local MCP to Skills: The biggest shift from local MCPs to Skills is that Skills not only apply the progressive disclosure pattern but also shift from text to Code. That has many advantages. First of all, if you give text, you are giving the chance to the LLM to trip and hallucinate, or just be indeterministic. By giving code to the LLM, the LLM will write less code and then execute it, which reduces noise and makes the LLM more deterministic in my running code. Engineering is deterministic, so image that now you can write the skills for the LLM.
Local MCPs are dying as they should; however, I don't see remote MCPs really dying. For instance, the idea of remote MPCs is something you could never run on your machine, like AWS or Figma. But for that, you have a problem: you would now need AWS credentials on all Claude code machines, and you would need to secure and rotate them. The right solution for that is to use an AI Agent Gateway and then have the Remote MPCs on the server, and all credentials would be on the server, and it is easy to secure and rotate.
You probably heard that LLMs would kill the SaaS model. If you don't want to watch the whole video by Satya Nadella. Here is the crux os the thesis. SaaS makes money based on seats, so they want to sell a license per user, so IF you have 2k or 5k users, that's how they make a lot of money. However if agents are doing everything, why do you need open a SaaS for instance? For sure, you need many fewer licenses. That's the theory why SaaS might be in big trouble.
Also, Build vs Buy has flipped; the code of building is much smaller now. So do you need a whole product? Maybe you don't need it, maybe you need much less, and maybe you can use AI to get what you need much faster. But what if SaaS providers start providing MCPs, and then the cost is adjusted? Well, that's something to watch out for.

In summary, this is what we got from TEXT and MCP to Code and Skills, Progressive Disclosure, and MCPs running on AI Gateways. But that's not all.

Frameworks

The second wave of evolution comes in the form of frameworks built on top of the Claude code and codex. Such frameworks are built on the basic constructs I mentioned at the beginning of this post: sub-agents, commands, hooks, and skills. Such frameworks enforce a specific workflow or style of engineering, such as Test-Driven Development (TDD), Rapid Loop, or a mini-SDLC like BMAD (to be nice, because in reality mimicking SAFE is a mincing anti-pattern and anti-agile and WASTE).

Some Popular Frameworks are:

Ralph is the simplest of them all. It's a while true bash loop that keeps re-running Claude Code until your PRD tasks are done. Each iteration gets a fresh context window, and memory lives in git. It solves the annoying problem where Claude thinks it's done, but it's not. Anthropic liked it so much they shipped it as an official plugin (ralph-wiggum). At 7,000 tokens, it's the lightest thing in the space. The idea is brilliant in its simplicity: why build a complex orchestration platform when a bash loop and git do the job? Raph is brilliant and dumb at the same time. It's brilliant because every run starts a new session, making it harder to use all the context and avoiding context rot. Ralph is dumb because it creates a new session every run, which is terrible for caching.

OMC (oh-my-claudecode) is a teams-first multi-agent orchestration layer. It injects 32 agents, 37 skills, and 31 hooks into Claude Code via plugin. It runs a staged pipeline: plan, PRD, execute, verify, fix, loop. It does smart model routing across Haiku/Sonnet/Opus to save 30-50% on tokens, and the notepad system survives context compaction. At 31,600 tokens (15.8% of your context window), it's heavy but comprehensive. The magic keyword system is clever — you just type autopilot or Ralph or team in natural language and things happen.

GSD (GetShitDone) spawns fresh Claude subagent instances per task, so task 50 has the same quality as task 1. It enforces the Idea -> Roadmap -> Phase Plan -> Atomic Execution pattern with a maximum of 3 tasks per plan constraint. The philosophy is that the orchestrator never does heavy lifting; it only spawns, waits, and integrates. At 283,800 tokens (141.9% of the context window), it literally cannot fit. It explicitly rejects what it calls "enterprise theater" — no sprint ceremonies, no story points, just get things done. What GSD and all these frameworks do is that there is a base cost for each message you type in clause code, and depending on what you do, it loads more things, and you see GSD is pretty big and sucks a lot of tokens.

Continuous Claude has two flavors. Anand Chowdhary's version is a single Bash script that loops Claude Code through branch creation, PR opening, CI checks, and merge-or-retry. parcadei's Continuous-Claude-v3 is a different beast: 32 agents, 109 skills, 30 hooks, all focused on context preservation via ledgers and YAML handoffs. The motto is "compound, don't compact." Anand's version at ~430 tokens is basically free in terms of context cost. Continuous Claude, it's an attempt to do a smarter Ralph IMHO. It's not super token-hungry, and you can add limits like running for 10 iterations or 10 USD.

GasTown is Steve Yegge's Go-based multi-agent framework that spawns 20-30+ parallel agents. A Mayor (Opus) distributes work to ephemeral Polecats (Sonnet) using git-backed "Beads" as external memory. It costs ~$100/hour in API costs. Yegge has publicly stated he never looked at the generated code, which is both impressive and terrifying. It supports multiple runtimes (Claude Code, Goose, Codex, Gemini CLI, Cursor, Amp). The idea is to treat coding like a factory: you talk to the foreman, and the foreman manages the workers. My experience with Gas Town so far was not the best, it's sucked all my subscription tokens in 15 minutes and plus 14 USD I had as credit, and it was choking with an error, and then Opus 4.6 said this to me:

Claude Flow is an npm-based platform that deploys 60+ agents in swarms with 6 topologies (hierarchical, mesh, pipeline, etc.). It has a Hive Mind, self-learning (SONA), a built-in vector DB (RuVector), and a WASM engine claiming 352x faster execution for deterministic transforms. It runs Claude Code and Codex in parallel with shared SQLite memory. Still alpha (v3), heavy on marketing claims like "Ranked #1" without clear verification. At ~16,000 tokens (8% context window usage), moderate in size but massive in ambition.

SuperClaude is a pure Markdown configuration framework. No multi-agent, no orchestration — it just makes one Claude Code session smarter via 30 slash commands, 9 cognitive personas, and evidence-based rules injected through .md files. UltraCompressedMode claims up to a 70% reduction in token usage. At 80,000 tokens (40% of your context window), it eats a third of your context before you even type anything. With 20.4k stars, it has the biggest community. The tradeoff is clear: richer behavior at the cost of less room for your actual work.

BMAD simulates a full agile team using 21 agent personas defined as markdown/YAML. It enforces a rigid pipeline from brainstorming through PRD, architecture, stories, sprint planning, implementation, and review. At 6,000 tokens base but 1.36M tokens fully loaded (680% of the context window), it's deceptively heavy. BMAD is the most prescriptive of all — documentation is the primary source of truth, not code. If you like structure and process, this is your thing. If you don't, it feels like SAFe cosplaying as an AI framework.

This picture is not an exact timeline, but you can get a sense of what is happening:

I measure token usage across all these frameworks, here's what I discovered:

┌─────┬─────────────────────────┬────────────┬────────┬─────────┬───────────┐
│  #  │        Framework        │   Tokens   │ % ctxw │  Lines  │   Chars   │
├─────┼─────────────────────────┼────────────┼────────┼─────────┼───────────┤
│ 1   │ GSD                     │ 283,800    │ 141.9% │ 7,500   │ 1,135,000 │
├─────┼─────────────────────────┼────────────┼────────┼─────────┼───────────┤
│ 2   │ SuperClaude             │ 80,000     │ 40%    │ 6,700   │ 270,000   │
├─────┼─────────────────────────┼────────────┼────────┼─────────┼───────────┤
│ 3   │ OMC                     │ 31,600     │ 15.8%  │ 3,195   │ 126,500   │
├─────┼─────────────────────────┼────────────┼────────┼─────────┼───────────┤
│ 4   │ Claude Flow             │ ~16,000    │ 8%     │ 1,000   │ 59,000    │
├─────┼─────────────────────────┼────────────┼────────┼─────────┼───────────┤
│ 5   │ Ralph Wiggum            │ 7,000      │ 3.5%   │ 745     │ 24,308    │
├─────┼─────────────────────────┼────────────┼────────┼─────────┼───────────┤
│ 6   │ BMAD                    │ 6,000 base │ 3%     │ 156,840 │ 5,454,268 │
├─────┼─────────────────────────┼────────────┼────────┼─────────┼───────────┤
│ 7   │ Claude Reflect          │ 3,150      │ 1.6%   │ 2,273   │ 91,219    │
├─────┼─────────────────────────┼────────────┼────────┼─────────┼───────────┤
│ 8   │ Continuous Claude       │ ~430       │ 0.21%  │ 2,314   │ 86,550    │
├─────┼─────────────────────────┼────────────┼────────┼─────────┼───────────┤
│ 9   │ Diego Pacheco CLAUDE.md │ 354        │ 0.18%  │ 34      │ 1,528     │
└─────┴─────────────────────────┴────────────┴────────┴─────────┴───────────┘

As you see, GSD, SuperClaude, and OMC use a lot of tokens. I also measure with my own local global CLAUDE.MD, which was smaller than all these frameworks, was succeeding only 0.18% of my context window (ctxw). BMAD was the worst experience I had, and it was also pretty boring when I was typing "y" and "1" most of the time. The final results were not really impressive.

We went from frameworks running on top of Claude code to multi-agent systems and operate Claude code like Gas Tow. However, we are not done yet...

Retrofit

The next wave is a wave of retrofit. What is happening is that Anthropic is paying attention to everything the community is doing and retrofitting it into Claude's code, and the same is happening for OpenAI Codex. Claude Code has introduced Claude Teams , with a clear influence from the 3rd wave of multi-agents, such as Gas Town.

In the last wave, we see the rise of "Threads" and AI coding agents like Codex embracing such patterns, as well as other tools like Superset and Conductor. Of course, after anthropic banning cli auth for 3rd-party harnesses, this will make it harder for such tools, as they will be forced to use only direct APIs.

Deming Circles

Also known as Plan Do Check Act (PDCA), it's a continuous improvement method from the 50s. As a big believer in Deming's work and Lean, I need to bring this back. Every single company is trying to use AI and figure out things. However, it's easy to just get hooked by the dopamine or the dark flow patterns (of gambling) and not reflect. We need to stop and think. Stop and digest things. I honestly see little benefit in any of these frameworks; the results were no more impressive than using the default Claude code with my custom CLAUDE.md.

But what I mean is, if you write an agent, is it optimized for this? Is the agent token efficient? There is a lot of BLOAT in files nowadays. It's easy to do a lot of things and not have better results, so we need to keep using science and be careful, and reflect on our choices.

How to Do Better

I cover several aspects of AI coding Agents evolution, now here is some practical advice:

Watch out for token usage
Watch out for the final result
You must read the code, you must judge the LLM result fully, not only how it looks.
The devil is in the details; you cannot do shallow work. The work must be Deep, especially in AI times.
IF you do a skill, are you sure it is optimized?
IF you do a command, are you sure it's efficient?
Using a framework means nothing; you cannot assume it drives better results.
Having an agent means nothing; you must make sure you are doing the best as possible.
Have a lean, small global CLAUDE.md
Don't BLOAT your CLAUDE.md, make sure you have pointers/hinters only
Make a lot of POCs
Beware of token cache can poison your experiments.
Test all frameworks and solutions out there and have your own conclusions.
Pay attention to the details.
Do not outsource your learning
Do not outsource your judgment
You must understand how it works and understand the concepts.
All these frameworks have markdown files, go open them and read them all.
When you build something, don't call it too fast; iterate many times.
Be careful with the illusion of control, more text, more specs != more quality.
SDD is not the way.
Vibe Coding is poison, do not do vibe coding, generate all code with AI if you want, but read it, pay attention to the details, do not be fooled by the hype.
It's better to have a few agents than a lot of agents (more in a future post).
Check things 3x. Do not trust the first thing AI coding agents tell you.
Keep learning new skills and keep critical.
LLMs dont ask why, don't push back, and don't enforce right principles; you must do that.

Cheers,

Diego Pacheco

Search This Blog

Diego Pacheco Tech blog