Experiences Building a Cassandra Orchestrator(CM)
Cassandra is a kick-ass NoSQL database. Battle tested by Netflix, Apple, Uber and so many great cloud scale companies. If you want to run Cassandra in production you need to buy DataStax Enterprise or engineer your own solution - since Cassandra community is not enough. My company loves open source and my customers love open as well so we decided to build our own Cassandra orchestrator for a lack of better name we called it CM(Cassandra Manager). CM runs on AWS(Ec2) and managed single and multi-region clusters for Cassandra. CM has java interfaces so even today it just runs on AWS it could easily be ported to in containers or other fabric runtimes, which may happen on the long-term future. So it might sound crazy when you think about to build engineering around Cassandra but is actually not several companies in Silicon Valey like Netflix and outside of the valley do similar things or Databases being NoSQL or Relational. Automation is must if you want to scale, automation is way more them "automated deploys" that's easy the ultimate automation is when you automate your operation and them allow you to truly scale.
Why Build CM?
Basically, we won't be able to Scale AWS Single and Multi-Regional deploys and Operations so we don't need to code any Jenkins or bash script in order to serve more clusters. Before CM we were doing backups with Jenkins jobs and we did not have ASG on top of Cassandra nodes so cloud ops team always had issues when a node goes down and had to run bash scripts to put the node back online with CM all that changes. The single region is not that complex but when we talk about multi-region things gets more complicated. So automation is really the solution.
Who we build complex system on the fly?
In order to Build CM we had to some POCs and also drawn some Color UML diagrams based on Responsibility-Driven Design(RDD). This might sound a bit old school and design up to the front, however, there is no way to build such a system without some up to front thinking and strong design and architecture leadership. My team works with Kanban and I do multiple roles in my team playing as Tech Manager / Agile Coach, Engineering, Architect, and DevOps Engineer Testing and Developer Support. My team does multiple roles as well so an engineer in my team does Software Engineering, DevOps Engineering, Testing and Developer Support. I and my team work with the idea of Self Service systems where a developer could easily consume what we do, so there are generic Jenkins jobs and good documentation is the place(Internal Wiki).
We use Kanban and track stories per week with simple math items/weeks we easily could keep develop and deliver CM in time. CM took 3 months to get it done. Thanks to our Kanban management, and internal system design with RDD and Color UML we were able to work normal hours. However, I underestimate the effort to get the system STABLE and the stabilization period that should be 2 weeks become 2 months. During CM tests me and my team found ~30 bugs in CM. I work with a small team right now(me and 2 people) when we start coding CM was 4 people and 2 guys left so after stabilization all gets a bit complicated and we had to work several hours in order to keep with new team and lots of CM bugs this was a bit stressful but we overcome that and deliver to production :D
It's all about Availability
When I was thinking about CM I knew we will need to make CM highly available. In order todo, we made CM optinional so Cassandra can operate without if CM dies or crashes after deploy. In case CM dies or crashes Cassandra keep operating normally and since CM is an orchestrator developer and application who w/r in cass don't see cm. So if CM goes down we lose some operations like:
CM Philosophy - Self Healing and Self-Operating
CM was a set of philosophies like Self Healing and Self-Operating system. Ops teams are used to Operate systems in production due to the lack of built-in philosophy. CM has a different philosophy so basically, you don't operate CM - CM does everything by itself(Unless there is a BUG in CM - them you need RECYCLE / OPERATE CM). CM was highly inspired by Dynomite, Dynomite-Manager, and Priam from Netflix, Although CM has a different architecture. I worked a lot with Dynomite and Dynomite manager on the last 3 years and this experience was very important and allow me to get it here.
Why not use Priam? Priam is great don't get me wrong. However, for my use case I need to support Cassandra 2, Cassandra 3 and soon Cassandra 4(when gets released). Priam has support for Cass 2 and cass 3x, however, Priam has branches and Netflix is migrating the whole fleet to latest Cassandra and we want to have a more long-term support for cass 2 and cass 3 and have our time windows. Priam is a co-processor also know as sidecar and Cassandra Manager is an orchestrator meaning it doesn't have in the same node as Cassandra so we can precisely control what runs in each not and when.
We don't have run multiple things on same time for same clusters like backups and repairs so we want to make TAKS run SERIALIZED like 1 at the time per cluster and parallel within multiple clusters in CM - this will be better covered later on this post.
Jenkins Builds
There are any Jenkins jobs - Basically, there are 2 main jobs. 1 to bake AMIS we can bake amis for Cassandra 2x, Cassandra 3x and CM. Them we have a job for lunch CM. For Devops Engineering we use Ansible in order to provision things in Amazon-Linux OS(Our Cassandras are running in Amazon-Linux latest version - we don't use ubuntu anymore). We use Packer in order to build the AMI for AWS ec2 - We do build in 2 regions. We create LC, ASG and EC2 instances for CM. Once CM is deploying we use CM to deploy Cassandra clusters. CM has java code in order to provision all Cassandra clusters this way we do almost all work in Java, so we have less code in bash as possible and this is great because we take advantage of great troubleshooting infrastructure and tools for Java and also the JVM compiler and good IDEs for refactoring like Eclipse.
Flacky Integration Tests
Since almost all the system is built in java it was very easy to build integration tests. We had to have 2 Tests Suites. One suite for Cassandra 2.x and other for Cassandra 3.x We were able to have the same code and same integration(end-2-end-tests) for cass 2.x and 3.x. However, we had to create a configuration that in the beginning of the sets the AMI and configs for cass 2x and later one switch to cass 3.x. My goal was to use Junit 5, however, looks like more and more Junit is moving away from integration tests and remove suite features and runners(we don't want to use legacy support thanks) so we end up using JUnit 4 which was better for integration tests.
Integration tests stated great and they easily were taking 20min or more to run since we deploy real Cassandra clusters(2x and 3x) during tests and we need to wait for aws to create objects and later on in the same suites we shut down this cass clusters. We test all major capability like backups, restore, seeds manager, restore. So as expected out tests get Flacky. Flacky tests suck. The main issues with our tests(discussed in an Agile Retrospective) were:
1. Timeouts are evil - We can't keep tracking timeouts - So we decide to switch to progressive timeouts.
2. Lack of 9 Rules: In math we can veryfied everything using math so we can check if is correct. we was not doing that in the tests and also in several tasks on the code. So if you dont 1 like that create a file - next line should check if the file as proper created.
We still have lots of work to improve our automated integration tests. What really saved us for the release was combination of Chaos Testings with Exporatory tests and checklists. I don't want to paint a rosy picture about CM so I still have mixed feeling with integration tests and we lost so much time with this.
Generic Remediation for Cassandra
Since dynomite, I had build remediation systems, So which Dynomite I was able to replace AWs AMI or scale UP dynomite without data loss or downtime. This was possible because 2 things: First because dynomie-manager had cold bootstrap feature and second because I coded a remediation process which would call DM(Dynomite-Manager) health checker and check if the node is ok and not doing anything and them kill the node wait for the node finish bootstrap and move on and so on and son. I basically refactord this code(which was and is written in java) and extend it to be a generic Gemeriation so same code wotks for Dynomite and Cassandra. We have some properties we can remediatie any cassandra node without data loss or downtime, who? Same idea we have the generic generiation process going node to node but calling CM(Cassandra-Manager) health-checker and asking if CM is not doing anything with thay node them we run a node_repair in cass so data if copied from other node this is very similar to a node_replace operation when a node dies. If you want leran more about this I made a separated post in the past its here and here.
CM Architecture
CM is built primarily with Java 8 and Inside of Cassandra node we have a Heart-Beat daemon written in Python which calls CM and decides with the node need to check in in CM(new node) or recovery(meaning node_replace) in cass.
We deploy high available Cassandra clusters where we have at least 1 node per AZ begin at least 3 nodes cluster. For multi-region, we do at least 6 nodes cluster. CM replicate its internal state to other CMs meanly in other regions in order to sync multi-region check-in for seeds management. CM was build using PEM files(Cassandra-Nodes) and we are currently looking to change this code in order to use AWS Multi-Region VPC Peering feature so we can switch from EIP and public IPs to private IPs.
Let's take a closer look at CM architecture now.
Basically, we have REST interfaces exposing all CM core functionality - This interfaces can be accessed with any REST tool - we build ours on a tool called CMSH which was built in Go and will be cover later on in this very post.
CM has a core class called Internal State Manager which managed all Cassandra clusters and state transitions. There is a bunch ofQuartzTasks which are responsible for doing backups, repairs, restore, seeds management and so on and on. Internal State is backed up to S3 and time to time CM sends internal metrics to SignalFX. We use collected to send all OS level metrics and CM calls Injection API in Sfx in order to send application metrics.
CM has an internal Tracker class which knows what every class is doing we have an internal Framework called Step Framework which keeps tracks of all sub-task level execution state and stores metadata internal. Later one this internal state in CM is persisted in disk and time to and also sent to S3. In the future, we will backup CM state in Cassandra nodes.
When CM talks to Cassandra it uses SSH in order to manipulate Cassandra filesystem and run local nodetool for backup and repairs. All heavy work happens in cass node, not in CM - Cm just does light state and coordination work. Even s3 Upload is done in cass node, not in CM.
CM Thread Model and Issues
First Thread Scheduling Implementation was using Quartz and Reflections and we realize this was open too much treads. Basically, you want to open your threads up to the front and they just re-use them like a Thread-Pool. However, with CM we have an interesting problem since we don't know how many cass clusters we would need to deploy. For the first implementation was creating a group of threads to run each cluster operation so was basically 4 threads per group. However we were running this on Quartz Threads without the need so first refactor was move some of the work that was in quartz threads like Heart-Beats, State Request, HealthChecker to REST operation only since they only interacted with cm internal state and was a very short living process. Second refactoring was changed quartz and open just one time the pool, however, this created a problem because if we have a big pool we would run things(TASKS) in parallel that we did not want like(backups and repairs) and if we put 1 thread it would be inefficient since the machine could handle more load and we would have 2 different Cassandra clusters competing with same resources. So it was clear that we need a new thread model.
Now Quartz is gone and all threads are managed by java concurrent Executors. There are 4 main thread pools. 2 Pools for Cassandra Clusters and 2 Pools for CM. 1 Pool is for Queue and 2 Pool is to schedule recurrent tasks(like backups and repairs) on the queue. Since is a queue one you consume it there is nothing more to do there.
Footnote: Back in 2010 I built a Scala Middleware using this concepts and ActiveMQ and JBoss 5 so I basically apply similar ideas on a smaller scale.
CMSH - Build a REPL in Go
If you work with Cassandra you might be familiar with cqlsh. Cmsh is like Cqlsh but for CM and for Cassandra clusters. Cmsh is written in go, it allows us to stress tests Cassandra and create a schema, insert data and remove data and also plays a role and centralized control panel for CM so we can trigger backups and repairs from cmsh and we can see logs can all cm rest operations pretty easily. CMsh is a REPL.So its much more productive them ssh each box and run alias scripts. This is also created because we can add functionality without having to rebake Cassandra or cm amis. Cmsh is very productive because you can connect in multiple cms with cmsh and also there is tab autocomplete, reverse back search, persistent history and much more. First was coding my own REPL later one moved to iShell which was great and save me lots of time.
Whats Next
There are many cool and challenge aspects of CM that was not covered in this post like Telemetry and Self-Healing, Multi-Region deploys and Checking, TTL Eviction for Nodes, Stress Testing CM, Unit Testing, Tuning and optimizations and much more this might be covered in future posts. Now the main focus of my team is to keep improving CM and support VPC Peering for multi-region deployments maybe later on this year we might do experiences with Containers(Docker). although I'm the tech leader and lead engineer this is a team effort and I want to thank Jackson and Tarzan for the hard working hours and fun for working together as an awesome team and tech challenged that is CM.
I hope you guys like it, take care.
cheers,
Diego Pacheco
Why Build CM?
Basically, we won't be able to Scale AWS Single and Multi-Regional deploys and Operations so we don't need to code any Jenkins or bash script in order to serve more clusters. Before CM we were doing backups with Jenkins jobs and we did not have ASG on top of Cassandra nodes so cloud ops team always had issues when a node goes down and had to run bash scripts to put the node back online with CM all that changes. The single region is not that complex but when we talk about multi-region things gets more complicated. So automation is really the solution.
Who we build complex system on the fly?
In order to Build CM we had to some POCs and also drawn some Color UML diagrams based on Responsibility-Driven Design(RDD). This might sound a bit old school and design up to the front, however, there is no way to build such a system without some up to front thinking and strong design and architecture leadership. My team works with Kanban and I do multiple roles in my team playing as Tech Manager / Agile Coach, Engineering, Architect, and DevOps Engineer Testing and Developer Support. My team does multiple roles as well so an engineer in my team does Software Engineering, DevOps Engineering, Testing and Developer Support. I and my team work with the idea of Self Service systems where a developer could easily consume what we do, so there are generic Jenkins jobs and good documentation is the place(Internal Wiki).
We use Kanban and track stories per week with simple math items/weeks we easily could keep develop and deliver CM in time. CM took 3 months to get it done. Thanks to our Kanban management, and internal system design with RDD and Color UML we were able to work normal hours. However, I underestimate the effort to get the system STABLE and the stabilization period that should be 2 weeks become 2 months. During CM tests me and my team found ~30 bugs in CM. I work with a small team right now(me and 2 people) when we start coding CM was 4 people and 2 guys left so after stabilization all gets a bit complicated and we had to work several hours in order to keep with new team and lots of CM bugs this was a bit stressful but we overcome that and deliver to production :D
It's all about Availability
When I was thinking about CM I knew we will need to make CM highly available. In order todo, we made CM optinional so Cassandra can operate without if CM dies or crashes after deploy. In case CM dies or crashes Cassandra keep operating normally and since CM is an orchestrator developer and application who w/r in cass don't see cm. So if CM goes down we lose some operations like:
- Backups
- Repairs
- Node_Replaces
CM has an Autoscaling group for it and persists its internal state in S3 so we build a recovery process for CM in case that happens ASG will spin a new ec2 instance and CM will keep running again. CM was built with multiple layouts in mind so you can have 1 CM for several Cassandra clusters since CM has it own thread Pool or you can have 1 CM per Cassandra cluster, this is great because it provides more availability and allows us to save costs in low production environments like DEV and TEST where we can use shared CM or in production for small clusters we can share CM as well. For clusters that are sensible, they can have they own CM. This is so true because we had an outage in production with CM, however, Cassandra was not affected and was running fine. All this can be seen as Reliable or Anti-Fragility system and as Failure Degradation mode(meaning CM fails don't make cass fails) all great Netflix ideas we stole :D
CM was a set of philosophies like Self Healing and Self-Operating system. Ops teams are used to Operate systems in production due to the lack of built-in philosophy. CM has a different philosophy so basically, you don't operate CM - CM does everything by itself(Unless there is a BUG in CM - them you need RECYCLE / OPERATE CM). CM was highly inspired by Dynomite, Dynomite-Manager, and Priam from Netflix, Although CM has a different architecture. I worked a lot with Dynomite and Dynomite manager on the last 3 years and this experience was very important and allow me to get it here.
Why not use Priam? Priam is great don't get me wrong. However, for my use case I need to support Cassandra 2, Cassandra 3 and soon Cassandra 4(when gets released). Priam has support for Cass 2 and cass 3x, however, Priam has branches and Netflix is migrating the whole fleet to latest Cassandra and we want to have a more long-term support for cass 2 and cass 3 and have our time windows. Priam is a co-processor also know as sidecar and Cassandra Manager is an orchestrator meaning it doesn't have in the same node as Cassandra so we can precisely control what runs in each not and when.
We don't have run multiple things on same time for same clusters like backups and repairs so we want to make TAKS run SERIALIZED like 1 at the time per cluster and parallel within multiple clusters in CM - this will be better covered later on this post.
Jenkins Builds
There are any Jenkins jobs - Basically, there are 2 main jobs. 1 to bake AMIS we can bake amis for Cassandra 2x, Cassandra 3x and CM. Them we have a job for lunch CM. For Devops Engineering we use Ansible in order to provision things in Amazon-Linux OS(Our Cassandras are running in Amazon-Linux latest version - we don't use ubuntu anymore). We use Packer in order to build the AMI for AWS ec2 - We do build in 2 regions. We create LC, ASG and EC2 instances for CM. Once CM is deploying we use CM to deploy Cassandra clusters. CM has java code in order to provision all Cassandra clusters this way we do almost all work in Java, so we have less code in bash as possible and this is great because we take advantage of great troubleshooting infrastructure and tools for Java and also the JVM compiler and good IDEs for refactoring like Eclipse.
Flacky Integration Tests
Since almost all the system is built in java it was very easy to build integration tests. We had to have 2 Tests Suites. One suite for Cassandra 2.x and other for Cassandra 3.x We were able to have the same code and same integration(end-2-end-tests) for cass 2.x and 3.x. However, we had to create a configuration that in the beginning of the sets the AMI and configs for cass 2x and later one switch to cass 3.x. My goal was to use Junit 5, however, looks like more and more Junit is moving away from integration tests and remove suite features and runners(we don't want to use legacy support thanks) so we end up using JUnit 4 which was better for integration tests.
Integration tests stated great and they easily were taking 20min or more to run since we deploy real Cassandra clusters(2x and 3x) during tests and we need to wait for aws to create objects and later on in the same suites we shut down this cass clusters. We test all major capability like backups, restore, seeds manager, restore. So as expected out tests get Flacky. Flacky tests suck. The main issues with our tests(discussed in an Agile Retrospective) were:
1. Timeouts are evil - We can't keep tracking timeouts - So we decide to switch to progressive timeouts.
2. Lack of 9 Rules: In math we can veryfied everything using math so we can check if is correct. we was not doing that in the tests and also in several tasks on the code. So if you dont 1 like that create a file - next line should check if the file as proper created.
We still have lots of work to improve our automated integration tests. What really saved us for the release was combination of Chaos Testings with Exporatory tests and checklists. I don't want to paint a rosy picture about CM so I still have mixed feeling with integration tests and we lost so much time with this.
Generic Remediation for Cassandra
Since dynomite, I had build remediation systems, So which Dynomite I was able to replace AWs AMI or scale UP dynomite without data loss or downtime. This was possible because 2 things: First because dynomie-manager had cold bootstrap feature and second because I coded a remediation process which would call DM(Dynomite-Manager) health checker and check if the node is ok and not doing anything and them kill the node wait for the node finish bootstrap and move on and so on and son. I basically refactord this code(which was and is written in java) and extend it to be a generic Gemeriation so same code wotks for Dynomite and Cassandra. We have some properties we can remediatie any cassandra node without data loss or downtime, who? Same idea we have the generic generiation process going node to node but calling CM(Cassandra-Manager) health-checker and asking if CM is not doing anything with thay node them we run a node_repair in cass so data if copied from other node this is very similar to a node_replace operation when a node dies. If you want leran more about this I made a separated post in the past its here and here.
CM Architecture
CM is built primarily with Java 8 and Inside of Cassandra node we have a Heart-Beat daemon written in Python which calls CM and decides with the node need to check in in CM(new node) or recovery(meaning node_replace) in cass.
We deploy high available Cassandra clusters where we have at least 1 node per AZ begin at least 3 nodes cluster. For multi-region, we do at least 6 nodes cluster. CM replicate its internal state to other CMs meanly in other regions in order to sync multi-region check-in for seeds management. CM was build using PEM files(Cassandra-Nodes) and we are currently looking to change this code in order to use AWS Multi-Region VPC Peering feature so we can switch from EIP and public IPs to private IPs.
Let's take a closer look at CM architecture now.
CM Architecture - More Internal Details
Basically, we have REST interfaces exposing all CM core functionality - This interfaces can be accessed with any REST tool - we build ours on a tool called CMSH which was built in Go and will be cover later on in this very post.
CM has a core class called Internal State Manager which managed all Cassandra clusters and state transitions. There is a bunch of
CM has an internal Tracker class which knows what every class is doing we have an internal Framework called Step Framework which keeps tracks of all sub-task level execution state and stores metadata internal. Later one this internal state in CM is persisted in disk and time to and also sent to S3. In the future, we will backup CM state in Cassandra nodes.
When CM talks to Cassandra it uses SSH in order to manipulate Cassandra filesystem and run local nodetool for backup and repairs. All heavy work happens in cass node, not in CM - Cm just does light state and coordination work. Even s3 Upload is done in cass node, not in CM.
CM Thread Model and Issues
First Thread Scheduling Implementation was using Quartz and Reflections and we realize this was open too much treads. Basically, you want to open your threads up to the front and they just re-use them like a Thread-Pool. However, with CM we have an interesting problem since we don't know how many cass clusters we would need to deploy. For the first implementation was creating a group of threads to run each cluster operation so was basically 4 threads per group. However we were running this on Quartz Threads without the need so first refactor was move some of the work that was in quartz threads like Heart-Beats, State Request, HealthChecker to REST operation only since they only interacted with cm internal state and was a very short living process. Second refactoring was changed quartz and open just one time the pool, however, this created a problem because if we have a big pool we would run things(TASKS) in parallel that we did not want like(backups and repairs) and if we put 1 thread it would be inefficient since the machine could handle more load and we would have 2 different Cassandra clusters competing with same resources. So it was clear that we need a new thread model.
CM Old Thread Model - Thread Issues
So basically the new Thread model would need to provide the following properties/requirements:
- Within the same cluster be serial - run tasks in sequential arrival order
- Within different clusters - run all in parallel
- Run CM tasks in parallel - SignalFX metrics, Internode communication, Recovery.
- Don't break current TASKS - we don't want re-write all tasks code(quartz tasks)
- Replace Threads from Tasks and reuse-tasks
Whats the solution? Simple we use QUEUEs and Workers. So we create and destroy Queues and workers all the time(As a Cassandra clusters do check-in(created) and checkout(destroyed). In order to make this work I had to create a QueueManager and a WorkerManager, So the QueueManager is responsible to assign Tasks to Queues and the WorkerManager to allocate workers in Threads from the pool. This solutions turn out to be more efficient and provided all design requirements we want.
CM New Thread Model - Queues and Workers
Now Quartz is gone and all threads are managed by java concurrent Executors. There are 4 main thread pools. 2 Pools for Cassandra Clusters and 2 Pools for CM. 1 Pool is for Queue and 2 Pool is to schedule recurrent tasks(like backups and repairs) on the queue. Since is a queue one you consume it there is nothing more to do there.
Footnote: Back in 2010 I built a Scala Middleware using this concepts and ActiveMQ and JBoss 5 so I basically apply similar ideas on a smaller scale.
CMSH - Build a REPL in Go
If you work with Cassandra you might be familiar with cqlsh. Cmsh is like Cqlsh but for CM and for Cassandra clusters. Cmsh is written in go, it allows us to stress tests Cassandra and create a schema, insert data and remove data and also plays a role and centralized control panel for CM so we can trigger backups and repairs from cmsh and we can see logs can all cm rest operations pretty easily. CMsh is a REPL.So its much more productive them ssh each box and run alias scripts. This is also created because we can add functionality without having to rebake Cassandra or cm amis. Cmsh is very productive because you can connect in multiple cms with cmsh and also there is tab autocomplete, reverse back search, persistent history and much more. First was coding my own REPL later one moved to iShell which was great and save me lots of time.
Whats Next
There are many cool and challenge aspects of CM that was not covered in this post like Telemetry and Self-Healing, Multi-Region deploys and Checking, TTL Eviction for Nodes, Stress Testing CM, Unit Testing, Tuning and optimizations and much more this might be covered in future posts. Now the main focus of my team is to keep improving CM and support VPC Peering for multi-region deployments maybe later on this year we might do experiences with Containers(Docker). although I'm the tech leader and lead engineer this is a team effort and I want to thank Jackson and Tarzan for the hard working hours and fun for working together as an awesome team and tech challenged that is CM.
I hope you guys like it, take care.
cheers,
Diego Pacheco