Testing Queues and Batch Jobs

Testing could be considered a solved problem. Everybody knows the importance of testing. Unit testing and integration testing are not rocket science. However, it is still not uncommon to see a lack of testing, poor coverage, and flaky tests in the industry. Internal service implementation can be tested using fakes and mocks. Testing classes is a simple task if you have a good design. Refactoring a legacy system to have a better design makes it easier to test and could be more entangled, but it is still possible and desired. However, sometimes you have a good design, which is still hard to test. Some architectures can be more complex to test naturally. If we consider standard RPC services, it's pretty vanilla to test them. However, things get more messy when we consider integration tests or end-to-end testing, mainly because of dependencies and state. From the test point of view, RPC services are simple. You have a request, very likely a rest call, and you have some assumptions; you make a call or a couple of calls, wait for results, and perform some assertions to check your assumptions that it profits? Not really. You will need to handle dependencies and state. State can be complex to manipulate to a specific value, moment, or even shape. Specially externally(consumer or tester). Now, if we consider batch jobs and queues, which are hidden by nature, testing gets more complex, and being able to handle state to perform proper testing is even more challenging.

Testing Batch Jobs and Queues

It's possible to just ignore it and not test batch jobs and queues. However, it could be better practice and can bite you pretty hard. Another option could be not to use Batch jobs and queues, but thats is a pretty bad idea since they are good patterns and allow great async and efficient computing. So, we need to test them; before we dive into more problems and solutions, let's think about some desired properties and goals for testing.

Desired Goals

Isolation: We want isolation in two degrees.

Design Degree of Isolation

Isolation needs to be preserved at all times, we do not want to leak the implementation or complexity to the consumers that is not necessary. We also want to hide as many implementation details as possible to have flexibility on the implementation. The less the consumers know about the service implementation the better. We dont want to break isolation by sharing databases, contracts, internal shared libraries, or hidden contracts.

Testing Execution Isolation

We should be able to support multiple exploratory tests at the same time. Imagine multiple QA engineers testing at the same time, and various engineers running automated tests at the same time and also doing the exploratory tests; we should never be stepping on each other's toes. No one should break anyone, no matter the environment or the testing.

Testable: Obviously, we want to be able to test the queues and batch jobs. However, we dont want to have a test at the cost of breaking isolation or leaking the design. Testing needs to be done in a way that does not weaken the design, does not make it less secure, or even incorrect. This is quite a challenge. i.e, you make everything public and accessible externally via rest endpoints, now could be a security concern and also leaking abstraction.

Separation of Concerns: We want to separate the test code from the production code as much as possible. Ideally, do not ship test code to production to reduce the bundle size(if you are in javascript) or to make the cold bootstrap faster(any non-binary/compiled lang); make it use less memory unless we want also test in production them is a different story.

The Distributed Monolith

One solution for sure you could be thinking is. Ok. Let's have two components: one, it would be the service itself, and the second, it would be a batch job. Resulting in an architecture just like this:

Distributed Monolith - Sharing the database

We are breaking isolation here and sharing the database between the Batch Service and the Batch system where we would have the Batch jobs. We are also inviting ourselves to have code duplication or push business logic to internal shared libraries(another very bad practice).

Such an approach also does not fix the testing problem; how can we test the batch jobs if they start every midnight? How can we trigger them? How do we pass parameters? How do we check for the results? Sharing databases is like eating candies. Can you do it just once, or will you eat the whole package? Often, you have two systems accessing the same database, but as time passes, you realize you have hundreds.

Distributed monoliths result in coupling, challenge systems to change/evolve, require complex maintenance and painful migrations, and, as we are seeing, are also hard to test. Just because we can, it does not mean we should.

Although the design can suffer here, we might have a good option in terms of performance. We are going directly to the database. However, we might create a reliability issue because if the batch job hits the database too hard, the service consumers might experience latency or even downtime.

Leaking the Contract/Implementation

We could make it better and avoid the distributed monolith. We can enforce a contract on the service and make sure the batch job does not access the database directly anymore. Now, the distributed monolith is gone. You have an architecture like this:

The batch system consumes the Service

Ok, we make it better. Actually, depending on what the batch job needs, this solution could be perfect. In order to this be perfect, we need two things:

Contract Operations: The services need to concisely expose what the batch job needs. In that case, the Batch system is not different from the UI, mobile app, or any other consumer.
Performance: We use batch systems in the first place for the sake of better performance. Now, calling a service could be slow, so this architecture might not work.

If we can have #1 and #2. Otherwise, this option does not work. Now, there is a second risk with this option here: we could be leaking details on the contract/implementation we don't want. Because again, we need to keep two more things in mind:

We need to test it: To be able to trigger the batch job, we also need to change state and perform assertions to check the results.
Leaking: Because of item #1, we might end up leaking things we would not want our regular consumers to see because they could break the consistency of the service or misuse it. However if you are on this route, the chances you need to expose things in the contract(since the batch job is the service consumer) are very high.

This option is tricky; it might sound very good on the surface but is actually not that good because handling state is something harder than it looks. Let's dive into that, starting with the simplest form of handling state, the database via SQL.

The Database

The good news is that this approach works in some cases. It will only work in some cases. At a glance, it looks like a great option. Because it is separated from the code. We have an architecture like this:

Database Approach to change state

We have a dual mode here. How we will run in production is different from what will happen in non-prod. For instance, in production, the system will work like it is on the left. In non-prod, it will work like it is on the right side of the picture. The good thing about this approach is that the architecture is the same. The only difference is that during testing, someone is running some extra SQL scripts on the database to change state.

The problem is, let's talk about queues. Usually, they are not on the database, and not all state is on the database, so this approach is limited. Plus, it does not deliver the second degree of isolation we want; it is inevitable for everybody to step on each other's toes since the database is a shared component.

Such a problem can be mitigated with containers in a local environment. Still, as we go to shared environments, it becomes a nightmare and a source of flaky tests, instability, and a waste of time. Let's remember that if we put another system to write on this database, we are going into the distributed monolith anti-pattern route, so it's not good.

Also, you still have one problem, how will you trigger the jobs? Let's see how we can move in a better direction.

Testing Interfaces

Now, what we really want is to eat the cake and have it. In order to do that, we need testing interfaces. Testing interfaces allow us to have an endpoint that exposes some api only for testing and hide such api in production. Consider the following architecture:

Testing Interfaces and Triggering Batch Jobs

Now, we have a rest controller inside the service that is capable of starting the batch processing immediately. This is great and very desirable. Now, we don't need to wait 24 hours to test it.

We can pass parameters to the rest controller in order to cascade such parameters to the batch jobs and even to the queues if necessary. Ok. Now we have fixed 50% of the problems, we have to wait to trigger the jobs and pass parameters, but what about the state? Now, we need to complicate things a little bit and introduce a blackboard architecture.

Blackboard architecture with testing interfaces

Now, we have a centralized service(blackboard) where we coordinate state change, state induction, parameters, result values, state coordination, and anything necessary to induce a specific state on the system. With such architecture, we can achieve the two degrees of isolation. Because it's done inside the service, it's transparent for the consumers(who don't see any of this in production) and also for QA, which is testing, or any other engineer doing exploratory or integration tests.

A side effect is that we are missing the production code with the testing code, represented on the diagram by a small gray box called "Test Code," as you can see, mixed with all batch job classes. Consider languages like C++ or Rust. We could use macros to make sure such code does not go to production. Java and other languages also have techniques to separate such code from the application. Spring has a nice feature on profiles that can be used to disable such code in production.

Since now we have full controll of state, we can induce different states for different test executions, let's say request #1 gets ids: [1,2,3] and request #2 gets ids [4,5,6]. We can controll and make sure tests are always isolated and never operating on the same ids. We can insider data and delete data at the end, which is all transparent for the caller.

Now, imagine we want to do more; what if we also want to do chaos testing with the batch system. What if we're going to do failure testing and inject a variety of different states? We can evolve such architecture even further.

Mock Server

The backboard architecture is excellent. However, if we have multiple scenarios, it will start to get complicated. Ideally, we should externalize such rules and states to a different system. We could have a mock server containing fake endpoints and fake states we want to load like this:

Mock Server for Chaos Testing and more complex failure profiles

Imagine we have a rest service that can fetch files from S3. We add a bunch of json files containing different profiles for chaos testing; imagine in one profile, we instruct the system to always return null. For another profile, we instruct the system to always hang. Not only can we inject chaos and failures, but we also can have different scenarios. Imagine this is like A/B testing, and we are doing experiences, so there could be hundreds of experiments but for testing. Now, we can choose how we want the system to behave. Making it thoroughly tested in a variety of complicated and yet important scenarios to be tested.

Now are capable of:

Trigger a batch job execution any moment, how many times we want.
Isolate batch runs (without race conditions).
Avoid flakiness and allow concurrent testing safely.
Test a variety of scenarios, failure scenarios, and chaos scenarios.
Mock any external dependency.
Inject/Induce any form of state.
We have a system that is fully testable, but preserves design integrity and do not leak implementation details.
We leverage automation and safety without compromising good design.

Not all tests are create equaly some things are harder to test than others, this is a good example of how we can test something hard with sound engineering and good pratices. As much as black box testing is desirable is not the anwser for all the things, sometimes you need to do it from whithin and internally allow testing. Testing interfaces are great, but just the beginning, state induction is necessary and can be good if done right.

Good testing is quality of life, better user experience and pays off on the long run.

Cheers,

Diego Pacheco

Search This Blog

Diego Pacheco Tech blog