The Death of CI/CD
- Architectures evolved: today we don't use one single repository, we have multiple codebases, multiple services, and service orientation (SOA) / microservices. So you would not have one single CI build, but multiple ones, unless you have one monolith.
- Cloud Computing: More and more we have more systems being distributed across several types of workloads and compute capabilities like EC2(virtualization), serverless, and containers. Software can run in more them one cloud (polycloud) and also on-premises in a bare metal machine. CI it's more complicated because you need to deploy all these new computing services.
- The explosion of Devices: Android phones, iPhones, Tablets, Arduinos (for IoT), and software running in all kinds of specialized devices like toasters, fridges, video games, etc... Guess what? You also need to deploy these devices, so there are more challenges.
- Engineering Specialization: We have mobile engineers, frontend engineers, Backend engineers, DevOps engineers, QA engineers, and many other forms of specialization. Which tends to create more repositories, and specialized tools, resulting in more segregation...
Ever Given
The year was 2021, The Suez Canal, a giant ship called Ever Given got stuck for days, providing the internet an infinity of memes about Kubernetes and deployment complexities. The boat was too big and could not really turn in the canal, creating an interesting and at the same time funny deadlock.
- Reduce batch size (always keep reducing batch size).
- Deploy frequency needs to be improved at all times.
- Complex operations should be automated 100% - Automation % cannot be stuck and needs to be always increasing.
Gates
Release trains, also known as release calendars are a way to improve things, high popular and deeply used across our industry, however, they can also be a way to make sure nothing ever improves. If you were doing 1 deployment per year and managed to improve the frequency to do 1 deployment per month or 1 deployment for spring (every 2 weeks) you are doing better, right? yes you are, but you cannot stop there.
The effect such "inertia" creates is "delays" in batches. Let me tell you how evil manages to survive, he stays hidden, visibility is the ultimate evil killer, if you see the problem you can kill it, but if you don't see it, that's where the danger is. It's like a water leak silently infiltrating 10 floors of apartments, when becomes visible is too late, and too expensive.
If you had a big massive delay, like the Ever Given incident, the business would notice and probably would ask for explanations, and there would be pressure for something to be done. But if you manage to introduce inefficiency, and make people feel good about doing the wrong things, in a hidden form in everything you do, this is perfect because no one will see and you will do a lot of damage (of course we don't want that, but we still allow it to happen).
Gates create delays and inefficiencies in batches. Considering the following diagram:
Release Train Gates
Consider, for the sake of simplicity, we have 3 teams, nowadays is more like dozens to hundreds but still, the problem is the same. Let's say Team A is doing Bug Fixes, Team B is doing features and Team C is doing a mix of features and bugs.
Release Trains or Release calendars are very rigid, because they are full of manual work, often done by a very small team. Now Team A has a bug fix which is done, however, the team needs to wait in order to release. That does not sound like a good deal for the customers.
Team B is doing features, but they are more like services, and if no one calls it, there would be no problem, they also could be released into prod right now, but the release calendar does not care if is a different repository and is a service no one called, you need to waitt. The issue that is created is, that people wait to merge code, tests are delayed and code is not continuously integrated.
Team C is doing some features, but they use feature flags and no new user will see them unless they want that to happen, team C knows the difference between Deploys and Release, so Team C is ready to deploy at any moment and release(make visible) to the user's with a simple flag switch, however, they are also stuck on the release train.
Usually how people work around this limitation is to game the system and do hotfixes like there is no tomorrow, which works to some degree but creates other problems because again the code is not being integrated and bugs may only appear on the "stabilization" phase. Now the effect that this creates is, don't touch it, do as least as possible, so we "reduce" the risk of something breaking. This thinking is broken.
Complex systems at scale, have power laws, It could be that 90% of your systems and use cases don't need to be gated but, Conway's law strikes, and that team would use the same process for everybody that 10% that really needs gating will delay everybody else.
Distributed Monoliths
Distributed Monoliths are a strong reason to have gates. Because you might think you have 100 microservices and they should be independent since they have their own code base, but as they share the same central relational database, the reality is they need to be in Sync. Such sync means they can't be deployed in isolation since there is a risk of cascade effect and a big blast radius.
Distributed monoliths can and should be fixed, there are techniques and approaches that can be taken to fix them. But if you are not doing that, here is a reason to have a release train. Many engineers think CI/CD is a tooling problem, not really, is an architecture problem.
You can't do proper CI/CD with the wrong architecture.
Batch Sizes
Because we have the "Gates" in the release train, another phenomenon that happens, it does not matter how much you deliver, since you are doing delivery every two weeks, there are dependencies, and lots of excuses to not improve. One of the big Lean / Agile principles is to reduce batch sizes and increase the frequency.
Let's consider the following scenarios:
Developer #1: Does 1 every two weeks, he does a big PR with lots of files, so the team thinks this is good, and normal, because there are a lot of files, and it manages to hide the indecencies. No one complains about the fact that is just 1 item for the whole sprint. Because he had a lot of files on that PR.
Developer #2: Does 1 per every week, it's better than developer #1, but because developer #1 exists this guy is seen as the "expectation" or someone "outside of the curve". However no one is looking to increase his release frequency, why is he not doing 1 PR per day? Maybe is being lucky and getting smaller tasks, are we comparing apples to apples here?
Developer #3: We have visibility issues here, clearly we have bigger problems, why does no one complain about this? You should be questioning other things like:
- Are we doing retrospectives?
- Why is okay to have a PR open for so long?
- Maybe this is a big feature and should be broken into several?
- Do we have a Discovery issue here? Maybe we don't know what we need to do?
- Is it a dependency issue? Are we blocked by some other team?
- Are we measuring teams properly? Pretty sure daily meetings are happening every day.
Fixing the train inertia
Change is not required, because companies can bankrupt and it's okay if that's your goal. However, if you care and want to do it better, we need to work in a way that enables change. We can't fix all problems in one day, but we need to be able to look back in every retrospective and see we are going in a better direction and making progress consistently.
If your team is not changing things on how you work, or making progress consistently maybe the team becomes the train itself and is too busy and cannot improve. If you don't want to be a train(I won't call you Thomas), what needs to happen is:
Do this:
- Care about it, and be proactive.
- Do not fear change, fear stagnation, fear lack of improvements.
- The team talks and finds better ways to do the code as a daily activity or CR dies too.
- Always reduce the batch sizes, If you do 1 per spring, try to do 1 per week, if you do 1 per week, aim for 1 per day. Even if you can't achieve the goal, aim to get better, and don't accept things as they are.
- Invest time, doing Refactoring
- Do proper Software Design
- Do retrospectives and discuss this topic forever, this is a non-ending game.
- Slice and Dice your monoliths in order to remove the gates
- Move towards a distributed release mode, where every team releases its software.
- Automate everything, so no manual work is required to release software.
- Change your process, change your pipelines, change everything, rolling stones don't gather moss. Change can be better introduced with POCs and Experiments.
- Have collective ownership (XP principle) and do things, even if you are not the formal owner.
- Be curious, see what the market is doing, what other companies are doing.
- Be the change you want to see.
Cheers,
Diego Pacheco