Generic and Script Remediation

Some time ago I was sharing my experiences with a Dynomite Remediation Process I wrote. Iḿ using remediation systems for a while and IMHO the add lots of value since they automate manual Cloud Operation work and save time for people. Currently, I refactored my Remediation code and now the same code can support Dynomite / Dynomite Manager but also Apache Cassandra. There are very similar concepts between Dynomite and Cassandra Remediation such as discovering AWS EC2 Ips to be remediated, AWs Resourcing(Creating SGs, Deleting LCs, Updating ASGs), Health checking(Is the node up and running? Could I remediate right now or you are in the middle of a backup or just booting up?). Once I identified that core concept I was able to create a high level and generic design for Cassandra and Dynomite. This is great for cases of Patch / Fix(apply a new AMI) or Scale Up(Increase the memory or CPU for instance). Although that the main use cases and most important ones there were some corner cases and always it will have URGENT fixes that don't and can't wait for a new AMI. Remediation allows AMI rollback and therefore is great for immutable infrastructure. I respect and agree with Immutable Infrastructure principle however there are use cases where we need to do things differently. Don't get wrong, if you don't craft your use case with care is easy to reach a scenario where you end up harassing immutable infrastructure and therefore creating lots of side effects. That's not my goal here. However, the Hell is full of good intentions some a tool like this need to be used with lots of care otherwise bad things will happen. Right now you must be wondering why do Script remediations? Why don't use something like Terraform or Ansible and that's it? Ansible is a great tool and Terraform too, however, there are several kinds of problems and solutions. My problem is quite unique(considering my customer, my current problems, and current stack). I do believe most of the things I'm sharing here can be used by a broader audience, however, IF you don't have this problems or value DevOps Automation somethings here might sound too much.

Terraform, why not? 

First of all, Terraform is awesome. However, I have a very unique kind of problem here. I use NetflixOSS Dynomite and Dynomite Manager.Dynomite is written in C and Dynomite Manager in Java. DM(Dynomite manager) does all kind of automation work like security groups, backups and restores to S3 using Java. So I don't have an Ansible or Terraform Script to change, instead, I do have Java code which is better IMHO. If you are curious about this DevOps style of work please read this. Secondly, the kind of problem we have here(Data Layer - not microservices) required very dynamic work and this kind of work could be done by Terraform however it would require write a plugin in Go or generate Terraform code. Since My Stack(based on NetflixOSS Stack) is Java and I work on a Java shop it does make sense to keep it in Java.

The Issue with Ansible

Ansible is cool. However, it falls on the same issues like Terraform. I would need to write a plugin in Python or generate ansible code which would not be ideal. So Again being in Java would be better, since DM is in Java already and I do deliver(A guy with Engineering background) DevOps with Code(For data problems) is soo much better. In regards to ansible I have a second issue - In my current project, we don't use Ansible Inventory. We just use Ansible for provisioning, so this also makes discovery hard. Even If I had inventory it would have others problems like Microservices developers often don't like do anything related to DevOps work so they might not have even access to AWS console so asking they to use Ansible might be not ideal. IMHO developers should be responsible to DevOps as well however I think core and platform teams need to provide better tools to make that work easy.

The case for Script Remediation

The Script generation is generic. It's possible to apply bash scripts in Cassandra and Dynomite clusters using a generic Jenkins job. The Script remediation will receive a bash script which will be encoded using base64 in order to be passed as java properties from Jenkins to gradle and also a timeout. Timeout is required because the generic script remediation will connect on each cluster node and apply the script. Some scripts might take more time them other so that's why we need to pass the timeout.

The Generic Script Remediation is just a simple java code that connects on each node and applies the script. There is some information that is captured to generate a final report such as host, AZ, time to execute the script, it was successful or if there were errors. That'st it.

This code just works fine for Dynomite and Cassandra. There are some use cases for this approach in my project - like:

  • Data Mugling: Sometimes developers need to clean up data(Stress test or Investigation).
  • Ops Emergency: Security Patch, Telemetry or quick rollbacks.
Most of times Patch / Fix remediation will be used and therefore force immutable infrastructure but for that 2 use cases I described Script remediation is better and also faster. Regular(Patch/Fix) remediation takes lots of time(~16 minutes for 3 node cluster). Script remediation can be done in less than a minute. 

Since you can pass a bash script that could be a backdoor to hurt immutable infrastructure or even destroy the cluster, loose data or create availability issues or even downtimes. So this need to be used with care and lots of caution. Using with wisdom this is a great tool to save people time and reduce manual work and avoid repetitive tasks such as cleaning up data before the stress test.

Whats Next

Currently, I'm wondering If I should have a 3rd kind of remediation based on security groups because time to time you need to open ports. Since DM generate all security groups(there is no script to change it) and create java code for that is too much. However I'm not sure if that would be a real abstraction and real value or would be just another interface since I don't see developers opening ports, this is a task for a cloud ops team and they don't have issues with ansible nor terraform. So they could go to aws console or run ansible from a bastion node. Thinking this way the 3rd kind of remediation based on SG might be just another UI(Jenkins) and that where I'm. Also could not find a Java API for Terraform. I also don't want to create a new DSL  for remdiation would be better leverage Terraform syntax but them this would require to colde a plugin or generate Terraform code. So for now I will wait and see if this business case become more important, right now that did not happen a lot of times and we need be lean, automation is great but there is a cost associated with and sometyhimes is better leverage new features.

Diego Pacheco

Popular posts from this blog

Podman in Linux

Java Agents

Manage Work not People