Anti-Fragility Requires Chaos Engineering

There are lots of buzz around microservices and everybody want do it right now. The main problem is people are not providing the right level of isolation in sense of operation system or containers, database and interface without that microservices are just webservices with REST.

When you go from 10 services to 100 you create more distribution, more distribution is more failure, for sure the network is not reliable even with a cloud provider like amazon web services. You need anti-fragility on the DevOps Era we live, why? because otherwise we just changing the names but doing the same mistakes as before. You will built and operate software automated, not only your deploy pipeline, but them that's enough? No is not at all, why not? Because your software and your infrastructure will fail, and as Vogels from AWS said, all the things fail all the time, you need be able to quickly recover, if you need that, them we need 3 basic things.

1. We need Design For failure

You cant do it later like in enterprise solutions, you need arch and code your software in a way that the failures are compartmentalized  and one failure does not affect others, otherwise you want be able to provide experience degradation. Whats Experience degradation? Lets say you have an Home back site, if your credit card system is not operating that's fine you still and operate your account and do loans and withdraw money but would be wrong if the credit card system is down and the whole bank is down? Well we design and code software like that like a huge monolith and just pray for when the failure happens it does not spread.

2. Once you design, them you need test it

That`s when Chaos Engineering enter the room, you cant make sure your design works if you don't test it, right? So there are N kinds of tests you was not doing before you need todo now. When you do TDD or agile, doing some DoD(Definition of Done) what was the DoD? Coded? PO Accepted? Deployed? That`s not enough anymore it needs to be automated and battle tested you need test failure not only the business side we are use todo with TDD.

3. You get Ops with you

Even if you do that on the Dev side, you need do on the OPS side, because failures will happen, best thing todo is be ready to handle failures and make sure you keep improving things, one great practice you can do is Incident Training, you basically get your Dev and Ops folks and go unplug some boxes so simulate failures them you can see if Ops can fix the problem, you should measure everything, time and steps. This creates a baseline for you them you can repeat this exercise time to time and your goals is to avoid that failure happen or even at least reduce the time ops have to fix the problem. This is a ongoing exercise it need to happen time to time and as much as possible.

Chaos Engineering

It`s about Design and Testing both your infrastructure and your Code. Simian Army is a great cloud-native solution for AWS created by Netflix and can help you to test your infrastructure to see if you recover and you expected, my experience is that even if you are on AWS and using NetflixOSS Stack you will need tweak things and believe me find bugs. Better figure out this early than later.

Design for Anti-Fragility

Its all about decoupling and have the ability to have Partial Failures(One failure does not affect or degrade or affect as few as possible other services) and you my friend who is doing microservices and is not thinking about this, good luck :-) You need embrace as much as possible the Share Nothing model, very strong on the functional programming and NoSQL community. Scalability is about how apps manage state, as less as you share as easy is to make things right. You dont want block as well do everything async, today there is a boom for Rx technologies, you should go that road(keep in mind error handling is painful).

Failure is a natural state, your code needs to address that explicit, there is a difference between errors and failures you need have explicit code to thread failure, the famous Fall Back mechanisms, are simple call back code that run when something fail. This ideas are embraced by Akka, NetflixOss as well.

Network Failures and Random Failures

Microservices are calling each other through the network, so you need test this explicit, there are tools like comcast, iptables, wireshare, proxy you can use to simulate all kinds of TCP chaos like tampering, drop packages, add delay, hang forever, etc... You need test this between your service calls to make sure you have designed right you services. That`s all great but this is induced failures, you are making that by design and expecting your app to recover, another cool thing todo is Random Failures, when you have some level of maturity you can have things running in production everyday and them the code wont be expecting, you also should randomized, to run in different hours with different methods and failure scenarios, Netflix does that with chaos monkey(the one who tear down instances) and for the last 2 years he is not able to create trouble in prod any more.

Distributed computing is hard, failure is hard, but we need start design and testing this from the beginning and be more proactive instead of waiting the incidents in productions to fix this kind of issues, thats a natural step toward evolution.

Cheers,
Diego Pacheco

Popular posts from this blog

Telemetry and Microservices part2

Installing and Running ntop 2 on Amazon Linux OS

Fun with Apache Kafka