Telemetry and Microservices part 3

On the previous posts of the series Telemetry and Microservices part 1 and part 2 I described the desired features of a modern observability / telemetry platform should target to address microservices monitoring needs. I've also blogged about Alerting and monitoring with Sensu part 1 and part 2.  For this post, I will cover some components of the open source telemetry landscape which you can take advantage of your microservice monitoring architecture.

For this blog post I will be talking a little bit more about the following solutions:

1. Sensu
2. Prometheus
3. OpenTSDB
4. Netflix Atlas
5. Riemann
6. Graphite


Sensu

Sensu is one of the most monitoring / alerting solutions nowadays. Sensu does not have a Time Series Database so you want to proper dynamic thresholds / ML and is all about he OS. So you will need something like Ansible or Chef do get Sensu working properly. Sensu has some issues and annoying things like the usage of /tmp folder internally. In general is an OK solution with lots of plugins because is the same model as Nagios and Zabbix(script return 0, 1 and 2).

Highlights 

- Cases: Tesla, Cisco, Yahoo, Yelp
- Client/Server
- Checkers(0,1,2 -> monitoring) -> Don't collect the RAW Data(just the status if OK or not)
- Handlers(Transport: RabbitMQ, pipe(STDIN), TCP/UDP)
- Filters: Filter Key/Value Match
- Mutators: Transform Data, Reduce code duplication
- Plugins(Nagios compatible) Scripts -> Ruby/Python (Lots of plugins)
- UI(Uchiwa)
- Storage(Redis)
- MOM -> RabbitMQ -> Erlang Cluster
- Notifications: Email, Slack, PagerDutty
- Api / Events
Prometheus

Highlights

Prometheus is a very promising technology growing fast. It fits very well the container / docker world. The retention is not that great(in regards of Prometheus TSDB). The bad news is the alerting is based on Prometheus DSL so you won't leverage from existing plugins. However, you can do some Dynamic Threshold but is not perfect.

- Cases: Docker, CoreOS, SoundCloud, DigitalOcean
- Data Model: Time Series Streams(K/V) -> <metric name>{<label name>=<label value>,..}
- Query Language: Http REST -> sum(http_requests_total{method="GET"} offset 5m) // GOOD.
- Dashbaord: PromDash | Prometheus recommends Grafana*
- Storage: LevelDB(Im memory)
- Client Libraries: Go, Java, Scala, Python, Ruby
- Exporters(Import metrics into prometheus): Collectd, StatsD, Graphite, JMX, InfluxDB, AWSCloudWatch https://prometheus.io/docs/instrumenting/exporters/
- Alerting: DSL -> ALERT InstanceDown IF up == 0 FOR 5m LABELS { severity = "page" }
- Notifications: Email, Slack, PagerDutty

OpenTSDB

Is focused on cheap and scalable storage because of Hadoop. This is the greatest thing. You can have a long as cheap retention with this TSDB. Alerting needs to be addressed with other solutions like Nagios.
Highlights

- Data Model: Time Series with K/V
- Storage: HBase and Cassandra http://opentsdb.net/docs/build/html/user_guide/backends/cassandra.html
- Stores Everything -> Filters for Retrieve data from Storage
- UI with charts (no Dashboard)
- Have concepts like Downsampling, Aggregators, Counters, Rates
- Trees -> http://opentsdb.net/docs/build/html/user_guide/trees.html
- Simple Query Language
http://opentsdb.net/docs/build/html/user_guide/query/examples.html
http://opentsdb.net/docs/build/html/user_guide/cli/query.html
- Plugins: RabbitMQ, ES, CollectD
- For Alerting uses Nagios http://opentsdb.net/docs/build/html/user_guide/utilities/nagios.html
- You can use Grafana as Dashboard
- Ecosystem: http://opentsdb.net/docs/build/html/resources.html#monitoring
Netflix Atlas

Very promising solution. Built on top of Akka and Spray. However looks like not all components are open sourced yet and some pieces of the puzzle are missing. There is a killer REST api which can return JSON or a PNG image. The math support is amazing and you can do great analitics with it.

Highlights

Key Concepts:
-Time Series:
  * Sequence of Data Points
  * Interval between data points is called Step Size
- Tags:
  * SET of K/V associated with the Time Series
- Metric:
  * Specific Quantity being measured
- Data Point:
  * Combination of: Tag + Timestamp + Value.
- Step Size:
  * The amount of time between 2 successive data points.
- Values:
  * Gauge: AS IS value. Last value received will be the value for the interval.
  * Counter: Numeric incremented. Monotonically.
  * Rate: RATE per Second. Normalized.
Riemann

Reimann is the sexiest monitoring solution because is built on top of Clojure. You have an ultra powerful Stream processing model because of Clojure. Reimman has superpowers and is some easy to work with it. However for the traditional OPS might not feel the groove here because Clojure could scare then.  

- Site: http://riemann.io/
- Written in Clojure
- Event Streams
- Event Composition - CEP like
- Queries and outstanding math(Clojure)
- Visualization / Rendering / Dashboard*
- Graphite, PagerDutty, Slack integrations
- Alerting
- Clients(Plugins, Collectors,Exporters) -> http://riemann.io/clients.html
Graphite

Old and OK monitoring solution. There are lots os plugins and integration with other tools. IMHO Graphite setup is kind of complicated and is an OLD solution in the sense of architecture. 

- Site: http://graphite.wikidot.com/
- PypeD -> Collector
- Carbon -> Cache
- Whisper -> DB (FIXED-size DB, similar to RDD(Round-Robin-Database). A kinda of TSDB.
- Graphite -> WebApp / Graph Rendering
- Graphite uses Memcached
- Ceres -> Whisper replacement, Only store values and calculate Timestamps
- Graphite has visualization and rendering
- Awesome Math: http://graphite.readthedocs.org/en/latest/functions.html
- Integrations: http://graphite.readthedocs.org/en/latest/tools.html
- You use with CollectD, InfluxDB, OpenTSDB,
When we are talking about telemetry so far there is not one-stop-shop solution. You have to build your own or compise several solutions in order to archive your telemetry goals. 

Cheers,
Diego Pacheco

Popular posts from this blog

Telemetry and Microservices part2

Installing and Running ntop 2 on Amazon Linux OS

Fun with Apache Kafka