Environments and People

IF you talk with engineers, software architects, managers, pretty much everyone, the answer will be the same: they are not happy with shared environments. We can narrow down shared environments to common envs like dev, test, stage, demo, etc Pretty much all non-prod environments suck. That is the reality in our technology industry. I don't think I ever liked any env in any company, so think about this: did you ever see a non-prod env that you liked? Now the answer for this problem is very weird because if you talk to people, what do they say you you? IF we could have one more environment them we would get it right. So my question is, why can't we get it right in the environment that we have in the first place? Why do we need a new one? In fact, we don't need it. 

Why do we need environments? 

Some people might argue that envs are necessary for testing. I would say that this is a very expensive kind of testing, why do you need and environment, why can't you test on your machine? Why can't you have a less expensive testing mechanism? Some problems are harder than others for sure. For instance, let's say you have a legacy proprietary software that you don't have the source code for, which is much harder to test; it's not impossible, but harder for sure. 

Just because people don't know better and proper ways of testing things, they assume the only way is to have a shared environment. Real engineers don't have this problem because they run things on their own machines as much as possible and avoid shared environments at all costs. 

One could argue that we need a shared environment for complex, intensive, long-endurance testing. However, if we can reproduce such scenarios locally, you don't need the shared environments. For Continuous Integration (CI) reasons, you want a shared environment to integrate the code and make sure the tests run there all the time. 

What IF you can't run locally? 

Well, then we have something that is very expensive to test. Are you sure there is no Docker/Podman container available that would allow you to run locally? Sometimes there are solutions, it's just that such solutions are not integrated into companies' day-to-day workflows, or people don't know how to make such changes. Is it really impossible to have such bad software that you can't test locally? I think it is very hard... The reason is that engineers usually build tools when they cannot do what they need. 

Take AWS, for instance. One of the things I really dislike about AWS is that you have to pay to learn. Several solutions cannot run on your machine, so you need to have an AWS account, and you will pay an AWS bill in order to do a POC with a service, for instance, Bedrock, agent-core, or SQS.

However, engineers usually find a way. There is this open source project called LocalStack that I've been using for years at this point, where they mock AWS APIs, and that is not for production use cases; that is for testing, so then we can test things locally.  

What IF I don't have an API?

It's pretty normal that you won't have APIs for all the "state" you need to "induce". For instance, Services/Microservices might not have an API ready for you, where you can simulate a "state" where your purchase in an e-commerce is absurd, like you bought 300 iPhones, and that might be a fraud scenario. How will you test it if you have the code? You can create a testing interface and expose an API, which will be used in non-production environments only for the sole purpose of creating the right state for testing.

Not having an API is just an inconvenience; it's not a problem because if you do have the source code, you can go there and add the testing interface (API) you need, and boom, problem solved. 

What IF you don't have the Source Code?

Let's say you have an old legacy proprietary system where there are no APIS. Well, if the UI (even if it is desktop software), there is a way to create the state you need, even if via clicks, we can write software that does the clicks for us and create the data we need, that might be slow, but we still can create the "state" we need. 

What IF you can't click on the UI?

It's getting challenging... :-) Well in that case, and I say this as last resort really, you should never do this, never ever, but really if you can't do testing interfaces, call the ui, them you need do it in on the database. It's the last resort because this is how we created distributed monoliths, and it's not cool to access the database, but if this is the last resort, so be it, but again, it must be the last resort. 

We want to approach the problem like this:

What I'm saying here, is if there is an API go use the API that is there. You don't have the API, but have the source code; no problem, create a testing interface, which should be the default behavior for most of the problems, assuming you have the source code. For the special cases and difficult scenarios, go use via ui and really, if it's the last resort, go via database. 

It's a Data problem... (Really?)

It's not a data problem. It's a state problem. Let's go back to my e-commerce example. Let's say we need to test a scenario where the user tries to buy an absurd number of items (300 iPhones). Some might argue, I need production data that represents a real user buying 300 iPhones, well, I might not have that. So we're never ever gonna test it? 

So it's easy to think that it's a data problem. But in reality, it's a state problem. What you need is not production data. I would argue it does not matter if the name of the user is John or Dave. Who really cares? What matters is to have the user in that "specific state," such a state must be induced. Now, do we have the infrastructure in place to deliver such a state? Then we need to do it. 

One might say this is a test data creation problem, but I would argue it is not, because otherwise we are back saying that this is a data problem, and again it's not a data problem. You don't need special data; you need to be able to "provoke" or "induce" a variety of states. 

The Problem is not the environment

People love blame envs. But they forget, or they don't even know, that their bad practices are what is causing the environment to be bad in the first place. That's why a new environment doesn't work: people will make the same mistakes there.  Here are a couple of examples:

Hard-Coded IDs: People browse an ID in a database and use that ID. A different person does the same, but in a test that deletes the ID. Boom, we have a recipe for a time bomb intermittent flaky test. The ideal practice would be for me to insert data using an API or testing interface before the test runs, run my tests in an automated fashion, and then delete the data at the end. Now it's very, very, very hard to have a different test using the same ID, because we have isolation, and there is no problem running this in a shared environment. 

Lazy Re-use: Another common anti-pattern in a shared environment is that I need to test something. So I go there and look for a Jenkins job to test my changes. I have my branch, but I go there in an existing Jenkins job and just point to my branch. Never mind that Jenkins' job was a CI job, and now there is no CI running anymore.... What is the fix here? Well, let the CI job alone, don't touch it. Clone that Jenkins job, which has a job just for you, and therefore you will have isolation, and we still will have CI happening. 

Server Take Over: There is a server deployed on the cloud env called stage. Now someone needs to "test" a very specific branch there. So what do they do? They go on the machine and put their branch on the server. Now, other tests will break because they are being inflicted on a server that was hijacked. How do we fix this? Well, first of all, why don't we have real CI in this case? Why does it need to be a special branch? Why is the trunk, main, development (however you call it) not there? Why do we need change the branch?  Another point is, why don't you create a new instance of the server? Or even better, why don't you run this locally? Let's understand the anti-patterns here: It sounds like some "developers" don't have a local environment and are testing in the most expensive way possible. Second, IF we have proper isolation and proper automation, will people need touch events? Maybe because they don't. Maybe this is evidence of manual tests and bad engineering practices. 

Again, the problem is not the environment. Create as many envs as you want, you will have the same issues. Why do people need to touch the environment? Think about that, maybe they are using the environment as a very expensive local environment, and that again is wrong. 

Consider all the things I'm "saying here" we are heading to principles and best engineering practices like:


I have to say that without proper principles and right engineering practices, we will never fix this problem, and no matter how many envs we have, we will always be blocked by "OH I dont have an environment".  Why don't we have the same problems in prod? because people can't touch prod. Very few people can touch production thats why it's much more stable. Again, the problem is not touching the proper practices. 

How can we make it better? 

Here are some tips to make it better:
  • Always automate all tests
  • Make sure all tests are isolated
  • Make sure you have real CI (stop using feature branches)
  • Make sure you have a proper Induction Platform
  • Make sure you create Testing Interfaces on the right services if you don't have APIs
  • Always look forward to the cheaper way of testing, which is local, and embrace local envs.
  • Make envs be immutable, don't touch dev envs, do everything via automation, don't hammer anything there, dont put your branch there.
  • Talk about how people behave, and if that behavior is right or wrong. If you don't have these kinds of talks (usually in retrospectives), you can't fix such behavior.
  • Train managers to understand bad behavior and fix such behavior. 
  • Stop creating environments, start changing the culture and how people behave.
Cheers,
Diego Pacheco

Popular posts from this blog

Cool Retro Terminal

C Unit Testing with Check

Having fun with Zig Language