Failure is Hard

Couple of days ago there was another massive outage on Amazon Web Services. This time was around AWS DynamoDB and that create lots of noise on the the web.

Was not the first time a outage happened and wont be the last time for sure. There are some summary report from aws you can check.

So is this the end of cloud computing? Should we stop using cloud as single provider? All this questions start emerge from the web. I dont think thats the case - for sure there are some lessons learned from this last incident. The reality is distributed computing is hard and will always be.

Distributed Systems Fallacies - The network is reliable - Fail

You are doing distributed computing failure will happen. The Cloud is not magical. There are frameworks and approaches we need embrace more to being more anti-fragile. Like having user experience degradation and failure contention. But even if we test for failure will be cases that our tests wont catch. We need just make sure we keep learning from failures but more distribution is not the answer - i MEAN having multiple cloud providers is actually more problems because this will require more integration work.

Your Data Center 

One thing some companies do(Often when they have BIG money and the expertise) it`s run your own data center. IF you do have the hardware and expertise you can go for it. You might have better availability because is just your stuff there but this dont make the task easy and IMHO you still gonna have failure and could be even worst them use AWS or AZURE.

Not being LOCKED by amazon is or Microsoft is great but carrying this task is not easy and most of IT companies cant do at the same level of expertise.

Multi-Region Architecture - COST / Complexity

Yes we need Multi-Region architecture - we need be able to work with the active-active pattern but to be clear this is not easy at all - this introduce lots of COSTS and COMPLEXITY not to mention LATENCY, REPLICATION ISSUES and this is not easy. Sometimes is necessary yes it is. Do we need for every business i dont think so, Do we gonna need more and more? Absolutely.  This is not a FREE LUNCH there some solution that deal with this better and still lots of extra engineering work.

Takes time to mature it and remember there are 3 other forces happen at some time. 1 is new features, 2 is new software updates and 3 is data migration all this things together on a multi-region arch creates lots and lots of headaches.

Uber and other have they own data center - Most of companies or most of startups cant afford this because is very expensive. The thing is that public cloud like AWS and AZURE are like celebrities and everything that happens with they are on spot lights and are potentiality for good and bad. This means you gonna have failure in your data center but since is just you there is less viral tham is amazon goes down.

The Steps Toward Evolution

IT operations are hard. Failure is even harder. There is one thing on the DevOps Community called Blameless Incident Reviews (Post Mort-ens) and from Agile Coaching and Have Retrospectives and Lessons Learned and its crucial todo this exercises and make changes in the tools and practices we have to build and operate software to make it less anti-fragile.

Failure in Hours per Cloud Providers 


This data is for 2014 so as you can see everybody fails - amazon pretty much one time per year at least but things are getting better and we need continue investing and learning new ways do build operate and test our services. 

Failure will happen - maybe we should just accept it and save some money for when it happens? Now if you decide do in house data center just because of failure this is wrong and based on FEAR and you know what? you will have failure in your company data center too. It`s okay to have your own data center if you have the right reasons, people and money.'

Update 24/09/2015: From 23 and 24 Sep i was in Santa Clara, CA for the 2015 Cassandra Summit and i talk with some Netflix guys i they told me Netflix did not went down during the DynamoDB issue in amazon - They said Netflix went down just on the same data center amazon had issues but the other data centers took over in the active-active pattern - they just operate with 50% capacity.

Cheers,
Diego Pacheco

Popular posts from this blog

Kafka Streams with Java 15

Rust and Java Interoperability

HMAC in Java