Observability & Domain Observability: From Understanding to Value

Observability is a must-have a property any mature distributed system solution and/or digital product.  There is no way to "buy" observability, you need to earn it. Observability is like car insurance. No one likes to pay insurance, however, if there is a care crash you do really really want it. However, you cannot acquire the insurance at the very moment of the crash it does not work that one. The car insurance metaphor would work partially If you have a monolith system. As you have microservices, therefore distributed systems, you have many more failure points and in order to detect that the "car crashed" could be much more hard and complicated. There are many challenges in order to have observability on the system. The main idea is that Observability is something LIVE which means it is not a one time job. So you want to have a better understanding of your system via instrumentation which will lead to better observability and this loop keep going. Some time about I was blogging about Telemetry and microservice, so you might want to check this out here 1, 2 and 3.

 Observability Pillars

Observability is about Metrics, Logs, Traces, Dashboards, Alerts, and Testing. Logs are important for troubleshooting or debugging, however, you do not want to start with logs.  You won't start with alerts and dashboards; In order to have good alerts you need to know a couple things:

* What to monitor?
* What is an anomaly?
* How to fix it?

If you create alerts without these 3 simple questions is very likely you will have noise alerts, false positives and therefore alerts will not help you out. Dashboards are nice because you can spot patterns and during troubleshooting, you can correlate things. As I said before Observability should not be done in a waterfall approach, so you can start with something and improve as you go.

Traces are a great tool, traces are much easier to collect in comparison with custom metrics, which will require application/service instrumentalization.  Traces are good to see where things are getting slow considering you have several microservices downstream calls.

Domain Observability

Domain Observability is quite new. The idea is to instrument the application with domain/business metrics not only system or technology metrics. However, domain observability is a new tern the idea of sending business metrics to centralized telemetry solution is quite old.

Why domain observability is important? Because how do we know the business is being effective? How do we know that the customers are using the product? Opening some UIs more than others? So Domain observability has lots of relation with A/B Testing, Split Traffic, and other pipeline patterns.

Observability is about understanding the system and whats is going on, Domain observability is about understanding the business. Any serious product discovery person doing real customer science need to have domain observability.

SRE and Proactive Work

SRE is about keeping the water flowing into pipes. As plumbing is sometimes a few people care or see it if your site is down everybody will see and complain. Not only you will disrupt your customer's experience, loose money and even sometimes harass your brand. Digital products need to care about SRE and reliability.

Due to the Digital Transformation initiatives and waterfall approaches, deploys in production are often delayed and therefore Observability is delayed.  At some point it makes sense, why would you pay for something in production since you are not deploying anything there right? Right, however, observability is not only about "incidents" or "reliability".

Have you ever feel your development is getting slow, having issues to see where the errors are? Well, you might be lacking some Stability Mindset and practicesYou might have tests and still take a lot of time to figure out where the issues are this will slow down your team productivity.  Having said so, Observability is not a DevOps, Architecture, Product Discovery only but also an engineering practice in order to increase reliability of the system.

From Understanding to Value

Observability is about understanding, understand what is going on, understand how your system behaves, accelerate your troubleshooting when you are coding new features, understand how your customers are interesting with your features not only to prioritize features but also to learn more about your customers to deliver better solutions.

In that sense Observability shifts from a very technical thing to a business concern, central do deliver VALUE in Digital products. In order to deliver value to your customers, you need to understand, deliver as fast as possible, improve the solutions as you go. Observability and Domain Observability need to prioritize and taken seriously.  Have you ever start accounting how much money you lose with incidents? How much money you lose by hunting bugs? How much money you lose by searching where the issue is in your code?

Instrument the system, Observe, Understand and Repeat! That's the way to go.

Cheers,
Diego Pacheco

Popular posts from this blog

Kafka Streams with Java 15

Rust and Java Interoperability

HMAC in Java