AI Agent Patterns

Previously, I was blogging about AI Abstractions and the various levels of abstraction. Now, I want to delve deeper into some patterns we can leverage while writing software using LLMs/APIs and building AI Agents. Many people still question the value of Gen AI, and I understand that, scams, delusional objectives and expectations, FUD, and many other incidents. However, for engineering, it's clear that there is value, and engineering as we know it may never be the same again. Agents are interesting because we can tweak how much autonomy we give to them; they could be bound to a straightforward and repetitive task like bumping the minor version of log4j, or could be used for complex tasks like booking a complete week of vacation with car, hotel, plane tickets, experiences, and plans. So imagine there is a dial that we can tune up or down, and how much autonomy we want to give to agents. Besides open-source models like DeepSeek or Llama, the cost remains complicated. However, cloud computing costs a significant amount of money and is dominating the market, and perhaps I'm wrong about that. Before diving deep into agent patterns, let's understand the context in which we can use these patterns.

A note on context Window: How it works

One factor that significantly impacts working with LLMs is the size of the context window. An LLM is typically accessed through an API with a paywall. LLMs have limits on how much they "remeber" of your previous prompts. To "remember," the LLM must be trained, which occurs through a context window.

You interact with the LLM via the context window. This is a key concept. Your Prompt goes into the context window, but that's not the only thing that goes there.

All files you share with your prompt in your IDE/Editor end up going on the context window. Usually, open files or your workspace if you've shared it.

We have very few ways to interact with LLMs. Besides MCP, we must share all information via the context window. So popular IDEs nowadays have something called "AGENTIC MODE" and IMHO this is a terrible name and very confusing.

Agentic mode is being performed by solutions like GitHub Copilot, Cursor, and Augment Code, among others, where the IDE/editor behaves like an agent and uses your terminal (most of the time) to run commands and figure things out. You cannot run an IDE in agent mode in production; this is intended for development purposes only.

Agentic mode is confusing because it is often confused with AI Agents that you will run in production using an LLM API, such as those provided by Anthropic or OpenAI. The patterns I will describe do not work with IDE Engineering agents, but they do work with AI Engineering agents that run as background applications, which utilize APIs (I told you it was confusing, not my fault).

Context Window Size Matters

Now, depending on how good or bad your code is structured, that means it will cost more or less money. All API calls incur a cost; everybody knows that. However, proper services and software that are well-written cost much less money because they are isolated and self-contained.

Distributed monoliths will cost more money; imagine you are doing a refactoring, and you have a distributed monolith. You have a lot of coupling and you have 10 classes that are coupled with 100 other classes, well, if you ask the LLM to do a refactor, you will need to provide these 110 classes always for refactoring. The more coupling and entanglement, the more you will need your context window, and the more tokens you will generate for input and output; all that costs money.

Since LLM APIs are expensive, I hope this serves as extra motivation to have well-written software, so we use fewer tokens. Think about this, more things in the context window, the better the LLM will go, but also will cost more money, so it's also like you want software to be well-written and modular; otherwise, in big distributed monoliths and enormous monoliths, you will burn tokens like crazy $$$$.

LLM models have different context window sizes. The best model for coding is Claude Sonnet, currently version 4.0; you can see that the context windows are 128k.

Rag Pattern

Rag pattern has everything to do with the context window. The idea of the Rag Pattern is to help extract relevant information and send it to the context window alongside your prompt. Rag can be used to close a gap that LLMs have, which is the latest versions. LLMs are trained on past data, so they struggle to understand the latest versions of libraries and frameworks; therefore, they may produce outdated code.

To create a Rag Pattern, we require a vector database to store embeddings. We perform a semantic search on the database and return relevant documents that can be injected into the prompt. Keep in mind some crucial aspects here, such as the database does not run inside the LLM model; this is outside. In other words, we are performing pre-processing before sending data to the LLM. We are doing that all the time, since LLMs dont have "Database" or long-term memory. In a sense you could easily see an LLM as a CPU.

Rag is primarily focused on text and documents, which may be suitable for some data points, but will not work well with multiple systems that utilize APIs(usually REST with json).

MCP Pattern

Model Context Protocol is a way to describe APIs to LLMs, so the architecture of MCP usually involves the client, which can call the API and feed back the API results (usually JSON) to the LLM model. Again, it goes to the LLM model context window.

MCP is a game-changer because we have lots of systems and APIs for everything. One significant problem with MCP is authentication, as you need to authenticate with all these APIs. This requires providing credentials to MCPs, and there are already MCPs on the internet leaking secrets. There are MCPs for everything nowadays.

At the end of the day, MCP is not much different from Rag Pattern; the main difference is that Rag is more suitable for documents and text. MCP does a similar pattern, but using API Calls. So we can rely on Expert systems outside of the LLMs for better judgment.

Here is where things get very interesting, as we are in the complete blend of AI and Engineering (APIs). So, AI is not just AI; it's mixed with engineering. Now let's dig into more AI Agent patterns.

Such patterns are architectural patterns, and they all existed before AI; guess what, they were also used in APIs, Services, Microservices, EIP, and SOA. So if you worked with engineering before, you will recognize all these pattern,s and they are just being applied to AI now.

Cache Pattern

Perhaps the most basic pattern in software architecture history. We can cache the LLM output based on the prompt or based on similar prompts. This can be a great way to save money and expedite results. Consider that you have lots of similar prompts; it will be beneficial.

Reuter and Filter Patterns

The following two patterns, along with the filter and router, are used. Imagine you have several agents or several LLM APIs; you could route different prompts to different agents or different LLMs. A filter can be used to remove content from the prompt. Let's say you want people to add PII to the prompt; you can detect if it is PII and remove it from the prompt. Another possibility is to remove unwanted content like profanity or things that break company policy.

The router is also an interesting pattern because we can use it to save costs. Imagine that in non-production, we could route a less capable LLM model, and in production, we could route to a more capable model.

Splitter and Aggregator Patterns

Splitter and Aggregator are already in use by more engineering agents. Usually, when you give a prompt, AI Engineering agents transform your prompt into a series of tasks(splitter).

Aggregators can be used to synchronize data. For instance, we could run several tasks in parallel and then aggregate all results at the end. Instead, we could use the aggregator pattern to perform benchmarks between models or even A/B Testing.

Task Orchestrator Pattern

Agents are processes and programs like any other ordinary software.

However, AI Engineering Agents do run things in parallel, which is why we use so many tokens. Besides the fact that they need to split and aggregate, they often provide a CLI mode where they run your prompt and exit, so this could be used to perform orchestration outside of them.

AI Agents Orchestrator Pattern

We can apply the same pattern but in a bigger scope, where we can have agents orchestrating other agents. Imagine having a specialized agent for UX Design, another for Frontend Engineering, and a third for Testing, so you could coordinate all three agents on a project and combine their work.

Agents are interesting, and, in my opinion, there are interesting use cases with AI on the backend related to engineering. AI for the end consumer is more dangerous and a bit more unpredictable. I'm pretty certain we're not far from seeing an ESB with AI orchestrating agents.

Although AI and agents are cool, we need to keep in mind that we must always run security threat models in order to make sure we have protections and guardrails to prevent leaking credentials. We also must be aware of the costs and monitor them carefully.

Right now AI is like a mainframe and, has this limits and expensive APIs. I hope we can run it locally and cost less computing so we can have it running in more places without worrying too much about cost or even token limits and cooldown.

Cheers,

Diego Pacheco

Search This Blog

Diego Pacheco Tech blog