Experiences Building & Running a Stress Test / Chaos Platform

Stress Test is something that everyone should be doing if you care about performance. I run software at AWS so is easy to spin up more boxes but is also more expensive and you might be in a trap because you might have ineffective software with Memory Leaks, Hidden Bugs, and untuned servers which will all show up with scale or stress tests - whatever happens first. Unfortunately, this still not quite popular among most of the microservices developers. Create and run stress tests is something hard and tedious because it involves lots of moving parts. A single microservice could call several other microservices some of the downstream dependency graphs could get complex easily, this is just one of the many challenges involved with stress tests. Today I want to share some experiences in building and running a stress test and chaos platform. Do you know how many concurrent users or requests per second(RPS) your microservices can handle? Do you know what it will be the COST to scale your services? Do you know if you can SCALE your services in a COST-EFFECTIVE way? All these questions can be answered with proper Stress Tests.

Why Build a Stress Test / Chaos Platform

I run Stress Tests with Gatling. Gatling is a great tool, not only because you write your tests in Scala(Not XML or some complex UI like JMeter). Gatling Scala DSL is very simple and effective to test microservices via HTTP / REST APIs. So why build a platform? Why Running local is not enough? First of all running local is fine in order to get the Stress Test Scenarios right but as a baseline is completely wrong since you don't have same hardware/infrastructure as production.  Secondly, there are several other questions we need to answer in order to know whats going on - these questions are impossible to be answered locally:

  • Do you know what are the previous results? You are faster, slower? 
  • Did your service has more or less latency? What was before?
  • Did you increase or decrease the RPS(Request Per Second)? 
  • What about resource usage? Are you CPU, Memory, IO or Network Intensive? 
  • What about Chaos? If a downstream dependency dies what happens to your service? 
  • Do you test all that manual or it's automated?

I always remember Google SRE book when they say "Hope is not a Strategy". Most of the developer doesn't care and/or don't have the tools to answer this questions in a productive way. To be able to answer this questions we need to have the following capabilities:

  • Set up and run Stress Tests
  • Collected and Store data for further comparison(baseline)
  • Analyze results
  • Isolation: Make sure different people don't mess each other tests

The Last Capability is complicated since, in theory, it would be possible to have a dedicated production environment per developer however in practice this is not COST-EFFECTIVE. So a sort of SHARED environment is needed. Sharting an environment requires some kind of scheduling/placement so we only run stress tests that don't use same downstream dependencies in order to make the tests fully isolated.

What about Chaos Engineering. My current project runs on AWS using NetflixOSS Stack. I run Chaos Tests with Chaos Monkey from Simian Army. How do we check when a downstream dependency goes down that a service call fallback to other AZ or to Static Fallback and recover from that failure in an automated way? The trick here was to use Stress Tests for Chaos verification. So basically a stress tests in running while chaos testing is running. This way we re-use same Stress Tests for Chaos and we don't need to write down 2 verifications.

What we Build / How it works?

There are 3 Phases in order to do the Whole Stress Test / Chaos Process. Phase 1 Plan and Code, Phase 2 Execution, Phase 3 Analyses. Let's get started with Phase 1.

Phase 1 - Plan and Code

Phase 1(Plan and code) you need to think about your Exceptions in sense of failure. This is needed if you are running a Chaos Test, If you are just running a Stress Test you don't need to worry about this. So for the Chaos Test, we need to think about what should happen for each failure scenario and how the code/service should recover from that scenario. Once you have that in mind you can write down your Gatling script. Is possible to write down assertions in Gatling/Scala in order to check if your assumptweres was correct.  You will write down Scala code and you might test it locally just to make it sure the code is correct. Them you can move one and push this code to github. When often create a project to make it easy and often folks have multi scenarios of chaos/stress so a project is handy. So this project is pushed to github.  Phase 2 now is Execution. Let's explain what will happen with the Stress Test / Chaos code.

Phase 2 - Execution

The Stress Test / Chaos Platform runs in Jenkins(So we did not have to build a new UI). However we just used Jenkins as UI, the Gatling machine runs as separated AMI, so if Jenkins fails we don't lose the tests doing this way we make sure the Stress Test and Chaos Platform is Reliable. Because Jenkins is not reliable.

The Jenkins job is pretty simple and receives some parameters like the github project url, the scenario which the developer wants to run, the number of users(Threads) the test run should, The duration of the test and if we should apply chaos or not, If yes for what IP. This information sends from Jenkins to the Scala code this is done this way so we can re-run the same tests multiple times increasing the load or the duration which is quite handy. This is the main Jenkins job but there is a second job which we call it Profile, so this job will run you stress test multiple times which different users(threads) which we call rounds, so first with 1 user, them with 10 users, them 100, them 1k, 5k, 10k and so and on, these rounds come by parameter so you can tell what the sequence you want. Why do we do this? We do this because them we have an automated way to know when the service breaks so we can know, how many users the service can handle and what is the latency as we increase users.

Continuing with the flow, after the user triggers the Jenkins job, the stress test code project will be cloned from Github and we will spin up a new Gatling AMI them the Target microservice will be stressed out during the period you specify. Every single machine we have has observability so we send metrics via collected to SignalFX. When we are calling a target microservice that microservice might call other microservice and that microservice another one and so on and on. Most of our services use Cassandra as Source of Truth and others use Dynomite as the source of truth other might just use Dynomite as the cache. During the test, if you are running a chaos scenario a new AMI with Simian Army will pop up and will kill some of your downstream dependencies. When the test is finished all me Gatling metrics will be sent to SifnalFX and zipped and send to S3. We use Jasper Reports to generate PDF reports with all metrics so the developer can have a nice PDF do the analyses of that test. We also use D3 + Puppeteer in order to render Gatling reports into images to shave it in Jasper, so part of the platform has some nodejs + javascript code. Most of the code is Scala. So If developers want to do comparisons they can go to SignalFX and get all historical results, we have custom dashboard there.

Phase 3 - analysis

Since all data is on SignalFX is pretty easy to co-related Cassandra Data with Gatling Data and pretty much all other information we have like hystrix metrics, OS level metrics and so on and on. Right now we are doing comparisons anyway. But in the future, this will be done by a service with will consider this in to the Automated Canary Score so if you degraed the performance, increase the latency your deploy will fail. Often the commons problems/Bottlenecks are file descriptors not being tunned in Linux, Connection Pools, Threads configurations, Lack of Cache, Too much logging. 

Lessons Learned

Build a Stress Test / Chaos Platform is not that hard, have developers to use it with discipline is the hard part. There are interesting challengings here don't get me wrong but some evangelism and support is the core port I would say. Stress Tests / Chaos need o be on the Definition of Done or Production Ready checklist of the microservices teams otherwise people might not use as much as they should.

Another big learning for me was the fact that not every developer like to do Stress Test / Chaos engineering, I worked with engineers who love it and others who hate it so cultural FIT is something I care a lot nowadays and make sure whoever works on my team want to do the kind of work I'm doing and care about DevOps.

Diego Pacheco

Popular posts from this blog

Podman in Linux

Java Agents

Manage Work not People