Telemetry and Microservices part2
This is the second blog post of the Telemetry blog-series. If you did not read the very first one check it out here.
Telemetry is more than just HOST monitoring and Alerting. You can find lots of tools for basic telemetry but is hard to find a comprehensive solution to address all modern needs.
Microservices are quite interesting for operations and business however they require better telemetry platforms and tools.
Especially if you have Cloud-Native microservices and you are working with A/B Testing and doing lots of live experimentations instead of old business analyst predictions.
Telemetry is hard and should scale as you architecture scales. I would say with the same pace. Otherwise, you will have trouble!
Automated Canary Analysis
This is one of the coolest things ever IMHO if you are doing DevOps Engineering. Continuous Deployment is easy nowadays, I would say it's fairly trivial. However verify of the Build, Feature or Experiment went well is not that simple.
There are so many key metrics you can take a look. For instance. There are basic HOST metrics like CPU, LA, Disk, Memory, Latency, Network you can compare to say if a deploy went well. Sometimes a release could degrade the performance. This will require you Telemetry platform have a Time Series Database(TSDB) to have several levels of resolutions in order to provide the comparison with previous metrics.
HOST metrics are one thing. A very basic thing. The error rate is a good metric but is not enough. As microservices enables us to have more flexibility and have different data stores and engines you will need have specific metrics for each thing. The key thing is: Make correlation with this metrics and calculate a SCORE with you can set levels of confidence and based on that you can automate the canary analysis. You can and should look to more complex metrics.
Dynamic Thresholds with ML
Static Thresholds sucks. They are a good starting point but as your microservices grow in number they start to create lots of trouble. Static Thresholds are based on your current knowledge which does not take into account future needs, spikes and growth.
The worst part is that you will create alerts on top of this Thresholds and it's likely you will be running into false positives. Another issue you can run into with Cloud-Native Microservices is the effect of Scaling Up and Scaling Down. Your alert manager System will trigger lots of occurrences when the fire might not be real at all.
Intelligent and modern monitoring software works with Dynamic Thresholds and Machine Learning(ML). People often use topological maps to correlate HOST alerts with system wide events. This is crucial on Cloud-Native microservices otherwise, you will run into madness.
Most of the tools have a way to do timeshift analytics. For instance, let's say CPU LOAD is bigger than 10 you want alert by Mail and slack. First of all, you telemetry solution need to Collect and Store(Often Time Series) the data points. Secondly, you will need an analytical function to do:
Today CPU Average for 1h / Yesterday CPU Average > 2
Them you will have a way better mechanism to alert on IT. However, this is not the default model we find in solutions like Nagios and Sensu which are based on the checker. Of course they can work with TSDBs like InfluxDB, however, is not the default behavior.
Cheers,
Diego Pacheco
Telemetry is more than just HOST monitoring and Alerting. You can find lots of tools for basic telemetry but is hard to find a comprehensive solution to address all modern needs.
Microservices are quite interesting for operations and business however they require better telemetry platforms and tools.
Especially if you have Cloud-Native microservices and you are working with A/B Testing and doing lots of live experimentations instead of old business analyst predictions.
Telemetry is hard and should scale as you architecture scales. I would say with the same pace. Otherwise, you will have trouble!
Automated Canary Analysis
This is one of the coolest things ever IMHO if you are doing DevOps Engineering. Continuous Deployment is easy nowadays, I would say it's fairly trivial. However verify of the Build, Feature or Experiment went well is not that simple.
There are so many key metrics you can take a look. For instance. There are basic HOST metrics like CPU, LA, Disk, Memory, Latency, Network you can compare to say if a deploy went well. Sometimes a release could degrade the performance. This will require you Telemetry platform have a Time Series Database(TSDB) to have several levels of resolutions in order to provide the comparison with previous metrics.
HOST metrics are one thing. A very basic thing. The error rate is a good metric but is not enough. As microservices enables us to have more flexibility and have different data stores and engines you will need have specific metrics for each thing. The key thing is: Make correlation with this metrics and calculate a SCORE with you can set levels of confidence and based on that you can automate the canary analysis. You can and should look to more complex metrics.
Dynamic Thresholds with ML
Static Thresholds sucks. They are a good starting point but as your microservices grow in number they start to create lots of trouble. Static Thresholds are based on your current knowledge which does not take into account future needs, spikes and growth.
The worst part is that you will create alerts on top of this Thresholds and it's likely you will be running into false positives. Another issue you can run into with Cloud-Native Microservices is the effect of Scaling Up and Scaling Down. Your alert manager System will trigger lots of occurrences when the fire might not be real at all.
Intelligent and modern monitoring software works with Dynamic Thresholds and Machine Learning(ML). People often use topological maps to correlate HOST alerts with system wide events. This is crucial on Cloud-Native microservices otherwise, you will run into madness.
Most of the tools have a way to do timeshift analytics. For instance, let's say CPU LOAD is bigger than 10 you want alert by Mail and slack. First of all, you telemetry solution need to Collect and Store(Often Time Series) the data points. Secondly, you will need an analytical function to do:
Today CPU Average for 1h / Yesterday CPU Average > 2
Them you will have a way better mechanism to alert on IT. However, this is not the default model we find in solutions like Nagios and Sensu which are based on the checker. Of course they can work with TSDBs like InfluxDB, however, is not the default behavior.
Cheers,
Diego Pacheco