Code for Reliability

It's normal to talk about failure when we talk about cloud-native architectures. Pretty much anything that runs on the cloud(data center) means more distribution(networking). Distributed Systems tend to fail all time. Chaos Engineering is great, however, it's for the infrastructure, therefore focused on the big picture. When I say big picture I mean outside of a microservice. Reliability is not just for the outside boundary or infrastructure but also for Inside a service or inside the system. Reliability Inside the system might have many names such as(very popular in Brazil during 2k years) Defensive Programming.  There are lots of synergy between reliability, defensive programming, anti-fragility and Efficient Internal System design. Efficient design it's not only about making your system efficient in sense of economics(code, readability, reduce maintained cost) but also reliability. Today I want to share some internal system design way of think in order to achieve better internal reliability. Unfortunately, there is not much fuss about that. I will be giving examples using Java 8. However, I'm sure this applies to other OOP languages. For sake of simplicity, I will omit complete code and you will see some sudo code just in order to GRASP the idea I'm talking about.

What is Reliability? What do I need to change?

Wikipedia definition of Reliability.

As you can see reliability is described as "trust", like something you can trust. So what trust would be in the context of distributed systems and cloud-native? Well IMHO means that you can resist several kinds of failure. It's easy to imagine failures on the macro level(External) such as:

  • A Data Center(Region) fail
  • A Microservice Instance Fail
  • A machine Fail
  • Timeout Happens
  • Latency increase
  • A service hangs and never returns
  • A service breaks the data protocol and returns dirty results
  • A downstream component fail and that creates issues for your service
However, Reliability also needs to be applied inside the system, micro level in other scenarios such as:
  • Lack of internal validations(i,e: expect an email and receives a number)
  • A database call fails: You try to persist but the client/server fails
  • You get wrong input data and your code blows(parsing issues - i.e: json/yaml)
  • A corner case in your code make you fall (untested corner case)
  • You were expecting one exception and another one happens(bad error handling)
  • Some HTTP call timeout and you were not counting on that(i.e lack of retry)

This means we can apply "degradation" thinking on internal design as well. This is a mindset and when we are coding there are decisions we made that can make your code more or less reliable.

Being Explicit

Being explicit it's a very functional way of thinking. So basically you should avoid having methods that return void. Methods returning void are hard to test and often have side-effects. Being explicit it's not only about what you return but also what expectations you create on your caller. Interfaces are ways to describe behavior and simple mechanisms to inform right contracts. Every single class has a default interface which is the public methods, even if you don't attach new interfaces you already have a contract. External interfaces are interesting because they could provide additional meaning to the call, for instance, you can implement: Retryable or Cacheable. When you do that you inform your consumer that piece of code will be trying more than 1x and you will cache results. Documentation is one dimension of explicitness however if you can translate the documentation in metadata like interface and annotations(talking about java) you will end up being more explicit.

The pseudocode code above express explicit some behavior that the class will apply. Interfaces are not only for sake of contract but also for sake of "demarcation" and behavior enforcement.

Safe Return

Basically, that are 4 options when we think about safe return. The first thing you can do is never return null. Null is dangerous because you can get an NPE(Null Pointer Exception) also null is bad because you cant distinguish null from error. So the second thing on being explicitly about return is throw exceptions(Runtime based) when there are wrong state or bad input parameters. The third thing is, going back to the null thing, you can return good defaults in case you can do the proper processing. The last but not least is that you can return Optional. The option is a monad(FP concept) which means that you might be the right or wrong result. The great thing about the Optional Monad is that make that possibility explicit for the caller.

Here are being explicit with the fact that you can call this code in a multi-thread scenario without issue because we have no shared-mutable state, therefore, this code scales and is ThreadSafe as the interface explicit tell us.

Retry and Timeouts

Every single HTTP call you make or network call(no matter the protocol) should have Retrys, timeouts and sometimes exponential backoffs. This is a standard SRE practice. Check this great material from google GCP. If you are dealing with OS process or something that might fail due networking is already a good idea to have this mechanism in places.

Validations and Limits

Everything needs to be validated everywhere. So often we have a multi-layer architecture like you have a front end, backend, multiple services, adding validations in all ends at least for simple input parameters when possible save latency and network round-trips but also improve the user experience taught a faster response. Validations make your code more resilient however we need also define limits, every time you can UNBOUNDED-* you will run into problems. In order to fix this problem, we need to define limits.

Everything that is not caped(has limits) will eventually overflow. The overflow of one system might create an overflow in other systems - that's why we need to have CONTENTION and once the failure happens we to stop the failure instead of cascading failures to other components and service. We will cover this in more detail in next section.

Try/Catch as Resiliency Operator

Try/Catch should be as granular as possible. A big try-catch block can be dangerous. In the other way around a big try/catch block make sure your code doesn't blow. So you need to think about try/catch as the basic mechanism for RETENTION and ISOLATION. Fine-grained try/catch allow you to know precisely where your code is not working thus you can have a better understanding, logging, and observability if you publish internal metrics accordingly.

As you can see here we are calling multiple service/component each service might fail. But if I can get a CarRental recommendation that should not make the booking experience to fail(which would be very bad for the user and user UX). So having fine-grained try/catch allow us to catch whats going one. This is a very naive sample, in real life, you might do things in async(non-blocking IO way) nevertheless you would still have try/catch expressed in the form on callbacks and still don't blow the whole process if one part fails. This leads us to NetflixOSS Hystrix which we will talk more about on Fallbacks section.

Good Defaults and Self-Tuning Systems

Good defaults often save time(configuration time) and also require less parametrization from the caller. Having the ability to change defaults at Runtime is a must have. The higher level of system maturity is when you can have the system perform self-tuning and change the internal defaults(thresholds) ad the system gather data from the users and assume better defaults. This could avoid downtimes, redeploys and extensive tweak and reset cycles. In order to understand if the system is self-regulating well - great observability is required.

Most of the times you won't need to code the WHOLE self-tunable system like ASG. However, you need to TRACK and PUBLISH internal system metrics to Cloud Watch for instance so the ASG can auto-tune your component. Good defaults are more basic but also important in the sense that they can reduce configuration complexity.


Fallbacks are the holy grail of internal system reliability. That's when we need total about Hystrix. Fallbacks are like catch blocks but with better instrumentation, since Hystrix already provide Thread, Cache, and Metrics for each command. This is much better and much more sophisticated mechanism then try/catch as we discussed before. So whats important here(when possible) is to have multiple levels of fallbacks. This is the ultimate resiliency property. So if you thing fails you try, another and another and another and another(multiple fallbacks before giving up).

It's important to keep in mind that fallbacks should be simple. However, in order to provide more resiliency, we need more code. Fallback can fail and you need to address that. I hope all the concetps and ideas halp to shape your coding midsets and make you build more resilient systems no matter if you are using netflixoss stack or not.

Diego Pacheco

Popular posts from this blog

Podman in Linux

Java Agents

Manage Work not People