Lessons Learned using AWS Lambda as Remediation System
Today I want to share some experiences I had in the last 2 years with AWS Lambda as a Serverless solution for Microservices Remediation. This was not a single effort from me only but a team effort. The lessons learned I want to share with you today was not only in sense of Serverless but also in sense of DevOps Engineering.
DevOps Engineering is the norm today. I really can't see a world without it. Code it is the lingua franca of everything around IT. Currently, there are more and more abstractions and solutions rising for DevOps Engineering, chances are we will need code fewer things as the time pass. However, as we push the boundaries of innovation we face new problems and new problems will always require better solutions.
DevOps Engineering is great however it's not a FREE LUNCH at all, like microservices, which are great too, there are lots of COST that are introduced. There is a requirement do structure and lay down teams in a different way and also there is a need for some common components, also called architecture or platform. There are COSTS related to DevOps Engineering because when you add a new component that component has standard properties that need to be fulfilled. An I talking about requirements? No. I'm talking about the Infrastructure COST with is not only money but a set of props that need to be provided in order to you have something usable and that adds value.
The Hidden COSTS of DevOps Engineering
Besides the Hidden costs, there are explicit costs like Design, Development, Testing, Troubleshooting. However, this explicit COSTS everybody knows so I'm not focusing on explicit costs today. Let's talk about HIDDEN COSTS.
This also could be read as "The Hidden Costs of Cloud Computing if you want to avoid a complete lock into your cloud vendor" or even could be read as "The Hidden Costs of Microservices if you really want to do microservices". What are this hidden costs? Well IMHO they are not hidden but IF you are not doing this everything might sound new, so let's get to the list:
DevOps Engineering is the norm today. I really can't see a world without it. Code it is the lingua franca of everything around IT. Currently, there are more and more abstractions and solutions rising for DevOps Engineering, chances are we will need code fewer things as the time pass. However, as we push the boundaries of innovation we face new problems and new problems will always require better solutions.
DevOps Engineering is great however it's not a FREE LUNCH at all, like microservices, which are great too, there are lots of COST that are introduced. There is a requirement do structure and lay down teams in a different way and also there is a need for some common components, also called architecture or platform. There are COSTS related to DevOps Engineering because when you add a new component that component has standard properties that need to be fulfilled. An I talking about requirements? No. I'm talking about the Infrastructure COST with is not only money but a set of props that need to be provided in order to you have something usable and that adds value.
The Hidden COSTS of DevOps Engineering
Besides the Hidden costs, there are explicit costs like Design, Development, Testing, Troubleshooting. However, this explicit COSTS everybody knows so I'm not focusing on explicit costs today. Let's talk about HIDDEN COSTS.
This also could be read as "The Hidden Costs of Cloud Computing if you want to avoid a complete lock into your cloud vendor" or even could be read as "The Hidden Costs of Microservices if you really want to do microservices". What are this hidden costs? Well IMHO they are not hidden but IF you are not doing this everything might sound new, so let's get to the list:
- Deploy: You need to have something that can be deployed when you commit code or with a simple PUSH in Jenkins for instance.
- Provisioning: Everything you do need to be fully AUTOMATED. This means not only the OS part but also the infrastructure - Here we work with solutions like Ansible and Terraform.
- STABILITY: Since this is CORE and will be used by many microservices teams, the shared component needs to be STABLE.
- Observability: Everything you do might have issues and you might not know what really is going on, so it's crucial to have full observability which is: Telemetry(Dashboards and Alerts), Centralized Logging, Distributed Tracking, Notifications(Slack) and so on and on.
- Operations: So what happens if you need to change a config or tweak something? If you need to CODE or Open a ticket and have problems Houston. Automated Operation is hard but really pays off in the long run since you scale and provide a proper experience for your users(developers). Automated Operations are also known as Remediation. When you have a software to react to events and don't require manual-human-intervention.
So every time you need introduce a new shared component on infrastructure you will need provide this props above. This is great, however, it's not free. This is also true when we talk about Databases since DBs are common shared infrastructure components. There is a tradeoff between have the best Application Design and Infrastructure Cost.
The Problem: Zombies
My current project uses NetflixOSS Stack which is great by the way. We do Java-based Cloud-Native Microservices. We use NetflixOSS Eureka as our mid-tear Discovery and Registry system for microservices. We run the microservices in AWS. We don't use ELBs in the top of microservices since we use Eureka but we use AutoScaling Groups. Sometimes we run into strange scenarios where the EC2 Machine is Up and Running but the JVM crashed. So the ASG would not recycle instances in this case and eventually, this scenario created Zombies. Zombies created serious issues - leading to downtimes and availability issues.
There was a need to build a simple remediation system, which could detect this "Zombies" and kill the instances, letting the ASG recycle that instances and boot up fresh and new instances. This, of course, was not the solution for all problems since sometimes you might have a bug on your microservice or a connectivity issue(missing security group rule). So this was the main rationale to build a remediation system for microservices running with NetflixOSS on EC2.
The Solution V1
The solution was very simple first, we basically would need to call Eureka time to time and them get the list of all microservices which was UP and RUNNING and than call the microservices Health Checker. If the Microservice health checker was returning anything different from HTTP Status Code 200 or Timing out(HTTP request to health checker) this means that service was not OK - So maybe we are talking about a Zombine. So if the Health Checker returns !200 more than 3 times we should kill that EC2 Instance and let the ASG Spin a new Instance.
This solution was coded using Python and we initially running on Jenkins. We have several issues with running this solution on Jenkins. First, of all the solution was coded without proper timeout control so the running time was varying too much and execution was overlapping for scheduling timJenkinsnkins. Second issues were the Jenkins queue. Back on the time, we did not have a dedicated replica so we were competing with everybody else for the same queue. This design brought many pains to my team since we end up creating an outage in Jenkins. Besides that Jenkins was not and was never designed to be a RELIABLE system so we realize this solution should be never running there.
We picked Jenkins as the running fabric for some reasons but the biggest reason was that Jenkins was for FREE in the sense that we would not have any infrastructure or DevOps Hidden COST like I described above.
AWS Lambda: Solution V2
AWS Lambda was a FREE solution as well. Not as much free as Jenkins was but really close. We had better reliability and since Lambda had support for Python we could re-use or code and just change the running fabric almost for free.
So the Python code was split into 2 AWS Lambdas functions. The first lambda was calling Eureka and for each microservice, a message was sent to an SNS topic. SNS has a nice property with AWS Lambda because you can spin a lambda instance for each SNS message that arrives. The second lambda was rector to have the rest of the Python code. So second lambda was responsible for calling The Microservice health checker and then if the health checker returns something different from 200, for 3x an EC2 instance, would be killed. This system had other components like AWS CloudWatch for triggering the first lambda every 1 minute. We use NetflixOSS Dynomite as tracking and record system. Slack as Developer and Operations notification channel so when something goes wrong, i.e: Unexpected Exception or an instance is killed we pop up notifications there.
This solution worked way better than the previous solution with Jenkins. Since we still benefit from lower infrastructure COST. You might be wondering about Dynomite, this was not a problem since we have a full infrastructure set up with proper self-service generic deploy, telemetry, driver solution.
AWS Lambda Issues / Limitations
I don't want to paint a rosy picture of Lambda. Like everything in life, there are tradeoffs and things that could be better - so this was the pain points we had:
- There were some issues setting up the connection between AWS Lambda VPC and AWS VPC.
- Troubleshooting in lambda still painful, CW logs time to appear, search far from ideal.
- There is a limit on 1k concurrent executions. After 1k you get QUEUE and after 2k your requests will be dropped.
- This is not lambda per se issue but the code gets very ugly since there are lots of ifs. Later I realize this is a problem that could be a better FIT for an FSM design. However, AWS Step Function is not quite there yet. One day It might be like Apache Camel or some old ESB but today is very limited in sense of EIP Patterns.
- Lambda has 5 minutes to run - it takes more time you might be terminated. More limitations.
- The execution limit is global, in other words, it's not per lambda. Its possible to set it up but is not default.
- There is some Hidden cache - IF you re-call your lambda in a frequency lower than 5 minutes you will see some "cache" behavior depending on how you structure, your python code.
In general, I'm happy with Lambda as runtime fabric for Remediation Solution. Although I think an FSM solution would have a better design FIT this would also require more DevOps Engineering COST like I said before. If you get curious about FSM here are some interesting solutions:
- Akka FSM
- Spring Statemachine
- Transitions(Python)
I hope this lessons learned to help you somehow and you consider more and more write automated operation solutions because of that's the way to go and evolution of DevOps Engineering in sense of Continuous Autonomous Operations. I think AWS Lambda has a interesting Feature and Serverless in general are growing a lot and reducing some of infrastructure costs for some use cases looks like a nice FIT.
Cheers,
Diego Pacheco