The Death of CI/CD

XP (Extreme Programming) leaders popularized CI (Continous Integration) in the late 90s. The idea was very simple, to continuously integrate code in a common branch usually called trunk, nowadays we call it main. The principles behind CI were (A) Reduce feedback cycles (B) Increase response time and release software faster. CI it's a powerful concept, that requires automation of the build process and of course, automated tests. In order to do CI you also an SCM (Source Code Management system), in the past there were various solutions like CVS (not a drugstore), SVN, and GIT rules today. By definition CI also requires continuous merging, which leads to interesting realizations like, how often should we merge, the answer is as much as possible, as frequent as possible. Today's environments are more complicated than they were in the 90s, back in that time, the CI process was generating one single binary. 
Even compared with 10 years ago, we have more complexity for a variety of reasons, such as:

  • Architectures evolved: today we don't use one single repository, we have multiple codebases, multiple services, and service orientation (SOA) / microservices. So you would not have one single CI build, but multiple ones, unless you have one monolith.
  • Cloud Computing: More and more we have more systems being distributed across several types of workloads and compute capabilities like EC2(virtualization), serverless, and containers. Software can run in more them one cloud (polycloud) and also on-premises in a bare metal machine. CI it's more complicated because you need to deploy all these new computing services.
  • The explosion of Devices: Android phones, iPhones, Tablets, Arduinos (for IoT), and software running in all kinds of specialized devices like toasters, fridges, video games, etc... Guess what? You also need to deploy these devices, so there are more challenges.
  • Engineering Specialization: We have mobile engineers, frontend engineers, Backend engineers, DevOps engineers, QA engineers, and many other forms of specialization. Which tends to create more repositories, and specialized tools, resulting in more segregation...
Now the question you may be asking yourself is, does CI still make sense nowadays? Yes, 100%. Based on the title you probably guess what I think and have to say, and yes, CI/CD is dead nowadays. I will explain why. Death of CI/CD is an industry problem, however, for sure there are companies doing real CI/CD out, so if this post is not for you, feel free to ignore it.

Ever Given

The year was 2021, The Suez Canal, a giant ship called Ever Given got stuck for days, providing the internet an infinity of memes about Kubernetes and deployment complexities. The boat was too big and could not really turn in the canal, creating an interesting and at the same time funny deadlock.

Evergreen stuck at the Suez Canal, 2021

In order to be precise, the ship Ever Given blocked the canal for 6 days, and nobody could use the canal anymore. In software terms, this is like having a production indecent where production is down for 6 days and the users cannot do anything but cry. 6 days blocking the canal created a backlog of 369 ships that were queued and could not move. Impacting 9.6B USD dollars in trading goods.

Before this issue, Ever Given passed through the canal 22 other times but the task was considered a "Very complex and highly risky operation". Do you guys ever hear similar story in software? Pretty sure you do. 

Making a metaphor with software, we can take some lessons from the Ever Given incident:
  • Reduce batch size (always keep reducing batch size).
  • Deploy frequency needs to be improved at all times.
  • Complex operations should be automated 100% - Automation % cannot be stuck and needs to be always increasing.
Is not uncommon to see Ever Given deploys happening all the time... Why? Because if you keep doing the wrong thing for years the cost of change won't be cheap. The Ever Given problem gets a lot of attention, this attention sometimes is good because generates momentum and justifies investments ($$$) to change things and improve. 

We usually fight a much more complex and maleficent problem, which is creeping and lurking around us which can be simply put as "inertia" where improvements get "stuck" and the "status quo" is the same, this problem creates an environment for other problems and anti-patterns flourish. 

Gates

Release trains, also known as release calendars are a way to improve things, high popular and deeply used across our industry, however, they can also be a way to make sure nothing ever improves. If you were doing 1 deployment per year and managed to improve the frequency to do 1 deployment per month or 1 deployment for spring (every 2 weeks) you are doing better, right? yes you are, but you cannot stop there. 

The effect such "inertia" creates is "delays" in batches. Let me tell you how evil manages to survive, he stays hidden, visibility is the ultimate evil killer, if you see the problem you can kill it, but if you don't see it, that's where the danger is. It's like a water leak silently infiltrating 10 floors of apartments, when becomes visible is too late, and too expensive.  

If you had a big massive delay, like the Ever Given incident, the business would notice and probably would ask for explanations, and there would be pressure for something to be done. But if you manage to introduce inefficiency, and make people feel good about doing the wrong things, in a hidden form in everything you do, this is perfect because no one will see and you will do a lot of damage (of course we don't want that, but we still allow it to happen). 

Gates create delays and inefficiencies in batches. Considering the following diagram:

Release Train Gates 

Consider, for the sake of simplicity, we have 3 teams, nowadays is more like dozens to hundreds but still, the problem is the same. Let's say Team A is doing Bug Fixes, Team B is doing features and Team C is doing a mix of features and bugs. 

Release Trains or Release calendars are very rigid, because they are full of manual work, often done by a very small team. Now Team A has a bug fix which is done, however, the team needs to wait in order to release. That does not sound like a good deal for the customers. 

Team B is doing features, but they are more like services, and if no one calls it, there would be no problem, they also could be released into prod right now, but the release calendar does not care if is a different repository and is a service no one called, you need to waitt. The issue that is created is, that people wait to merge code, tests are delayed and code is not continuously integrated. 

Team C is doing some features, but they use feature flags and no new user will see them unless they want that to happen, team C knows the difference between Deploys and Release, so Team C is ready to deploy at any moment and release(make visible) to the user's with a simple flag switch, however, they are also stuck on the release train. 

Usually how people work around this limitation is to game the system and do hotfixes like there is no tomorrow, which works to some degree but creates other problems because again the code is not being integrated and bugs may only appear on the "stabilization" phase. Now the effect that this creates is, don't touch it, do as least as possible, so we "reduce" the risk of something breaking. This thinking is broken.

Complex systems at scale, have power laws, It could be that 90% of your systems and use cases don't need to be gated but, Conway's law strikes, and that team would use the same process for everybody that 10% that really needs gating will delay everybody else. 

Distributed Monoliths

Distributed Monoliths are a strong reason to have gates. Because you might think you have 100 microservices and they should be independent since they have their own code base, but as they share the same central relational database, the reality is they need to be in Sync. Such sync means they can't be deployed in isolation since there is a risk of cascade effect and a big blast radius.

A distributed Monolith

Distributed monoliths can and should be fixed, there are techniques and approaches that can be taken to fix them. But if you are not doing that, here is a reason to have a release train. Many engineers think CI/CD is a tooling problem, not really, is an architecture problem. 

You can't do proper CI/CD with the wrong architecture.

Batch Sizes

Because we have the "Gates" in the release train, another phenomenon that happens, it does not matter how much you deliver, since you are doing delivery every two weeks, there are dependencies, and lots of excuses to not improve. One of the big Lean / Agile principles is to reduce batch sizes and increase the frequency. 

Let's consider the following scenarios:

Developer #1: Does 1 every two weeks, he does a big PR with lots of files, so the team thinks this is good, and normal, because there are a lot of files, and it manages to hide the indecencies. No one complains about the fact that is just 1 item for the whole sprint. Because he had a lot of files on that PR. 

Developer #2: Does 1 per every week, it's better than developer #1, but because developer #1 exists this guy is seen as the "expectation" or someone "outside of the curve". However no one is looking to increase his release frequency, why is he not doing 1 PR per day? Maybe is being lucky and getting smaller tasks, are we comparing apples to apples here?

Developer #3: We have visibility issues here, clearly we have bigger problems, why does no one complain about this? You should be questioning other things like:

  • Are we doing retrospectives? 
  • Why is okay to have a PR open for so long? 
  • Maybe this is a big feature and should be broken into several?
  • Do we have a Discovery issue here? Maybe we don't know what we need to do?
  • Is it a dependency issue? Are we blocked by some other team? 
  • Are we measuring teams properly? Pretty sure daily meetings are happening every day.
The question you need to be asking yourself is, why this question was not asked on day #1? Are we paying attention to detail? Maybe this is showing a bigger problem, that we don't have efficient feedback on the system and we are not learning effectively. But for sure there are releases every 2 weeks so it must be all good right?

When you don't reduce the batch size, you create scenarios for inefficiencies. 

Fixing the train inertia


Change is not required, because companies can bankrupt and it's okay if that's your goal. However, if you care and want to do it better, we need to work in a way that enables change. We can't fix all problems in one day, but we need to be able to look back in every retrospective and see we are going in a better direction and making progress consistently. 

If your team is not changing things on how you work, or making progress consistently maybe the team becomes the train itself and is too busy and cannot improve. If you don't want to be a train(I won't call you Thomas), what needs to happen is:

Fithing the Tran Inertia with principles

Do this: 

  • Care about it, and be proactive.  
  • Do not fear change, fear stagnation, fear lack of improvements. 
  • The team talks and finds better ways to do the code as a daily activity or CR dies too.
  • Always reduce the batch sizes, If you do 1 per spring, try to do 1 per week, if you do 1 per week, aim for 1 per day. Even if you can't achieve the goal, aim to get better, and don't accept things as they are.
  • Invest time, doing Refactoring 
  • Do proper Software Design
  • Do retrospectives and discuss this topic forever, this is a non-ending game.
  • Slice and Dice your monoliths in order to remove the gates
  • Move towards a distributed release mode, where every team releases its software. 
  • Automate everything, so no manual work is required to release software.
  • Change your process, change your pipelines, change everything, rolling stones don't gather moss. Change can be better introduced with POCs and Experiments.
  • Have collective ownership (XP principle) and do things, even if you are not the formal owner.
  • Be curious, see what the market is doing, what other companies are doing.
  • Be the change you want to see.  

Cheers,

Diego Pacheco

Popular posts from this blog

C Unit Testing with Check

Having fun with Zig Language

HMAC in Java