Lessons Learned with DevOps Engineering Experiences

Lessons Learned with DevOps Engineering Experiences

Almost 20 years ago an Agile movement was born. After 20 years there are still lots of confusion and misunderstanding on principles, practices, and mindsets. DevOps barely have 10 years but does not have a method like a scrum, Kanban or XP. However, DevOps rely on Lean / Agile there is no strict definition, method or formal guidance how to do proper DevOps.  

Like everything in life, this is good and bad at same time. DevOps is a movement about experiences that made some for some companies and people in order to have or practice better ways to develop and operate the software. Currently, DevOps is a mess. DevOps is a mess is Brazil but is also a Mess in the USA. For lots of people, DevOps means Ops doing code. For other companies DevOps means Cloud Ops Team doing ops. There are DevOps folks from Development background like me and there are much more.

There is so much confusion on the market that you really need be careful to make sure you are talking about the same thing when you are talking about DevOps. The software is a pretty big industry when you stop to think about. It's very hard and unlikely to have someone that is good in all disciplines.

IMHO is easier to code for Developers but is hard to have developers worry about stability and availability. That's default on Ops, however, there are lots of folks that don't want to code or that will not learn proper design or software engineering. There are struggles on both sides.  I just give trying to define DevOps, since there are some many branches and at the end of the day it does not matter - what is matter is to get the job done and get better.

Here are some lessons learned in the last few years on my experiences working with DevOps Engineering.  This is my personal experiences and you might disagree and have different experiences. I'm not pitching for a specific flavor of DevOps or anything like that.

Self-Service the ultimate bottleneck killer

When we talk about scaling people often try to fix this problem with Process. That happened before and after agile, look Safe for instance. Lots of people(including myself) argue that culture is another important element when we talk about scaling something at corporate level.

What I think few people talk the-the role of Software Architect on Scaling people and teams. Is your software is coupled you end up doing lots of branches and merges and that's now how you can scale(add more people).

Speeding up thing depends on good, decoupled, modular architecture which several teams could mean SOA(Service Oriented Architecture) or Microservices(to be more precise nowadays). However that's not enough, that's only an enabler. We need build software that enables people to do things faster with you having to interact with other people. This is the very definition of Self Service to me. In order words, some kind of abstraction that gives you more productivity and reduce time. This also can be seen is simple automation. However is the automation is not done right it could be easily a pain(remember what happens with poorly designed and maintained software?).

So let's say you need to create users for someone in one of your tools. Let's say git for instance. The old way it would be open a Jira ticket and let people do some manual work and track people process every day to see how things are going, in other words, we are adding people so we are adding latency. The right way to do it it would be and a UI or chat interface where are developer or non-developer could self-service and create the user by himself using a software. This could be done in several ways like just using a simple Jenkins job or by using a provisioning tool like Ansible.

Observability is a must have

Telemetry is important but we need to go beyond the basics. Microservice is the standard the facto way to develop software currently. In other words, this means more distribution and more network calls. Failure is very normal in this context because:
  • Breaking change in any API
  • Timeout
  • Throttling
  • Bugs

How do we know what's going on? Observability is the path to understanding. However, this process of understanding what happens is not straightforward by any means. Metrics like CPU, Disk, Network and Memory and too much basic nowadays. This metrics might serve us well on monolith times however now we need more and we can do better. Observability means:
  • Having distributed logs - And lot important things
  • Know all your network calls - Zipkin, X-ray, and Jaeger can help you with that.
  • Control your thread pools
  • Cap / Limit everything - don't have unbounded queues
  • Have application metrics like Exception counters, latency percentiles, and specific business counters.

Beyond automation - Design Matters

It's not hard to find Ops guys coding today. Today everything become an API. Hardware is an API. Infrastructure became an API. The network is API. However, software engineering is not only about coding but is also about:
  • Configuration Management
  • Proper Naming
  • Design
  • Architecture
  • Testing
  • Data Structures
  • Algorithms
  • Abstractions

Automation can be tricky like any software. Is your automation solution proper designed and tested? Are you using something like Serverspec─? How easy is to create a new microservice in your pipeline how much code is required.

Automation is just one part of the problem. Don't get me wrong. Automation is hard and takes time to perfect it. However how much abstraction your automation provide? How easy is to change. This is important things because in the end of the day it's software.

Don't Build everything - Build X Buy

There are lots of gaps and space to create tools. Today there are way more tools than we had 3 years ago. As the time pass, we should have more and more open source solution for DevOps Engineering. However, there are some great tools you can leverage. I'm in favor of Open Source but also in favor of focus. Not all problems are interesting or deserve your energy, something is better to buy something and focus your energy on strategy problems. That decision is not simple and changes from company to company.

Build X Buy exercises are important and you should not consider always build neither always buy. Are you do a Build X Buy evaluation with requirements and Pocs you definitely will be in a better place to build is that's is what makes more sense.

Their software you build is like a Kid. It will become your responsibility and something makes more sense to use managed services or buy SaaS solutions and remove some of the burden and responsibilities. Again this really depends on your business and priorities but this is a trade-off that needs to be thought thoroughly.

Stability Mindset

Finally the last but not least. How is the developer experiences? How easy is to develop new things? How often things break? We can't avoid bugs and mistakes but we can reduce lots of issues having the right mindset.

This is hard. Especially for developers. We are not used to not break things. But as you get responsibility for the things you do this start to sounds right and normal after some time. This really change your way to see the code and make calls.

There are several ways to address the stability mindset when we talk about solutions. You can have ways to ISOLATE solutions and make easy ways to test, deploy and rollback code. You should do that however you always will have SHARED things that are not microservices and that is where things will get more tricky and fun I can say.  I do take retrospectives very seriously and every 30 days I do tech retrospectives with the teams I work and this is also a very important way to imprint things.  People tend to see retrospectives as a Managerial practice to talk about the product, people, attitudes, and feedbacks and that's fine however they are an important tool to review, asset and improve your tech solutions. You can read more on the Stability Mindset here.

Diego Pacheco

Popular posts from this blog

Podman in Linux

Java Agents

Manage Work not People