Service Chain
Services are a very important architectural construct. Proper services(POSAM definition and principles) are even more important. However, not all services are the same. Not all APIs are the same, and definitely not all services are the same. Just having a Service or API, we can call it many different ways; the question is how good it is and how solid it is. I still put my money where my mouth is and say that services are a very powerful concept and the right level of abstraction for most problems. However, when a service is presented to you, you cannot assume you will have high quality just because it's a service. IF services are desired, multiple services must still be desired; after all, service orientation is about having services, not libraries. However, we know that when the granularity is wrong, and when we move data and capability to the wrong places, let's say, the wrong domain, we make things worse, not only risking breaking isolation but also leading to wrong abstraction, inefficiencies, and complexity. We cannot see services in isolation; services happen in a team context, and Conway's law is not hypothetical. It's a real thing. Teams tend to create architectures that mirror the organization's structure. However, an organization sometimes is not defined only by internal bounds but also by external ones. Usually there is always the question, should I build or should I buy?
Build vs Buy
Building is a good idea if you can make it better than what is on the market. However, building slows you down because you will spend time building rather than using something that is already done. Such an equation needs to be analyzed with a lot of calm and very carefully.
Let's take two extremes. Build it all; now you are constantly re-inventing the well and spending less time in our core business, fixing your customers' problems, and, therefore, doing less valuable work. Not all problems are the worst building.
Buy it all; now we are on the other side of the extreme. Today, there is an API for everything, but not all these APIs are the same. Keep that in mind. If some API goes very bad, you will affect your user experience in several ways, so not everything should be solved by buying.
Centralization or Distribution?
Centralization is often considered a bad thing, but it really depends. We could have bad centralization in the case of a classical monolith(POSAM definition), or it could be good in the sense of a modular monolith. Distribution is the same. IF the distribution is, a proper service(POSAM definition) is a good thing. Now if the service is a distributed monolith, that is a pretty bad thing (I have a whole chapter in (POSAM about Distributed Monoliths). Now, there are some other questions we need to ask ourselves.
Consider the previous picture. Which one is better, (A) or (B)? Well, besides patterns and anti-patterns, we need to ask some questions like:
* Ownership: Do you own (A) and do not own (B)? If that is the case, (A) could be better.
* Observability: Does (B) have good observability? If yes, it could be better than (A).
* Anti-Fragility: It's (B) rock solid; how does it handle errors? Anti-fragility is something you need to test; it's hard to grasp through marketing materials :-)
Now, we are going to a more complex and comprehensive view on build vs. buy. Decisions drive design, and not all engineering decisions happen in the code. Some decisions might happen on the product level or even in the business. Proper evaluation is a must. Because not all APIS are created equally.
Not all APIS are created the Same
Let's consider a more complex scenario. Let's say you are building a website, a mobile application, or even a LLM chatbot application using React. You just want to add your functional components in order to render some useful HTML for the end user. However, you need to call the BFF(often in nodejs) in order to fetch the data. Then we can the service.
So far, everything is under your control. Your team may not manage all these components, but your company owns them in one way or another. So, the gray boxes (UI, BFF, and Service) are owned by you. Now you call a gateway service, which is an aggregator, let's say the purple box. Such aggregators call other external services (blue boxes), and one of them calls another aggregator - the second purple box.
Here is where things get more interesting. When we think about services or APIs, we think of brand-new implementations written in Rust or Go, well-architected and with a high degree of quality. But what if some of these systems are old legacies, written in inefficient languages, using old, old, old technology. Now, we could be just wrapping things in an API, but under the hood, you still have maybe a desktop system, maybe a classic monolith, or even worse, a distributed monolith.
We often think of APIs like AWS that work and are reliable, but that is not the reality for all APIS. Imagine that some systems could be cranky and not work well all the time. Maybe because they are under heavy load, or maybe because they suck, or both.
When everything works, it is good. But when things dont work...
Error Handling, Troubleshooting and Observability
Now imagine, there is an error 500. IF the error is on the gray boxes where you have ownership, even if you dont have observability, you can go there and fix it and add better observability as a result. Now what if the error 500 is in one of the other boxes? What if is in one of the 2 purple aggregators?
Now, there is a long service chain, and buying too much could lead to big problems. Because you might not have a fix. Not only that, but you might not be able to detect the problem. IF you call 50 services, you need that all 50 services behave and have proper observability, error-handling, and anti-fragility.
Now, let's look at the same problem with different lenses. Do you want to "predict" or anticipate the problem to have some action? For that, you need observability. IF the service chain does not provide observability, how can you do it? The service chain can be a "black box" for you. Therefore, there is no way to understand what's going on there. Now, IDK if you realize this, but here is where we will affect the user experience in several different ways.
User Experience
User experience is affected in several ways. Let's understand them. But we can summarize it as a Bad experience.
Errors: Things might not work during the user experience. Do we know about them? No, maybe we dont even know. Can we know? Without observability, it will be hard. Can we do something? We dont know the code if it was a "buy" decision - now you can only work with the vendors, but that is not always easier.
Support: The user might decide to call the support. Again, because there is no observability, we might not even know this problem is happening. If only the users know, it's a bad experience. Now the user describe the problem, can we reproduce? We dont know the code, it might be very hard to reproduce it. Now think about a runbook or FAQ - what can the support tell the user? Try again? Pray? Again bad experience.
Scalability: If you keep ramping up users, can you easily tell things will scale? Well, you need to do a stress test, but not all APIs have mocks, sandboxes, and test environments; sometimes, you need to pay in order to do tests. Sure, you can always test your mocks, but this is not the real thing; you will only see the real thing in production. So, you must test in production. Now you have a gamble, add more users, and you will see if it works or not - like Google SRE says: Hope is not a strategy. How do you know the providers are ready for more scale? how do you know everything gonna work? you see, here is another opportunity for a bad experience.
Doing Better
As much as legacy systems can suck, they almost always do. If you own the code, that gives you a superpower. It's the power of refactoring, you CAN do refactor if you decide to. When you do not own, that option is not on the table. IF you pay for an API, you need to make sure it scales, has observability, and decent error handling; otherwise, the result might be a bad experience and worse than in the monolith possibility; now, your hands are kind of tight, and you can't do much.
Now let's think about some lessons learned and things you can do to make it better:
- Do a very careful build vs buy analysis.
- POC is not enough; you need to look for error handling, observability, and anti-fragility.
- Legacy systems can be awful, thinking about monoliths and distributed monoliths, but an API does not mean any better if you dont look into error handling, observability, and anti-fragility.
- Deep chains can only work well with good error handling, and all members of the chain need to do a good job.
- Error handling is not obvious and can be easily ignored; do not ignore it.
- It's possible to do some level of chaos testing on your end, mostly using mocks/fakes.
- Too many proxies and long service chains can be problematic for availability and troubleshooting; again, you need to ensure enough maturity and anti-fragility in place on the chain.
- Bad troubleshooting and bad observability are not just engineering problems; they are business problems because they affect the user experience.
Buildings have value. Code ownership has even more value because you can fix problems and improve the overall user experience. However, being careful of too much buying can lead to user experience problems. Evaluations are not always obvious or even possible. That's why testing in production is a mandatory practice. A API does not means instant stability. Unfortunately, not all APIS are created equal; don't assume anything, or if you want to assume, always assume the worst. Expect the unexpected.
Cheers,
Diego Pacheco