CDC vs DW Patterns part 1
There are 3 reasons why these patterns might be beneficial and relevant for you. First of all, if you want to replicate a database because you want to run some analytical workload(I made a recent BigData post that might interest you and related to this topic). The second reason could that you want a new cluster or new database technology. The third use case can be you slice and dice you monolith, and you want to split one service into 2 or even more services. No matter your use cases, the architectural patterns I will describe in this blog post will help you when you need to migrate data at Scale. Dealing with data is not the same as dealing with services. Data migrations are much harder than service migrations. It's pretty easy to do some rolling wave update and migrate service by service and re-deploy even reset them one by one. However, talking about data, there are extra challenges.
To copy the data, you can take a dump of the source(legacy database) and import the new database. What happens If the scheme is different? well can approach it in two ways: (A) keep the schema AS-IS, and after switching the SOT(Source of the Truth) to the new DB, you can change the schema under the hood. (B) We will need a more complex way to deal with the data migration/replication if you want Bi-DIRECTIONAL or do things incrementally(code changes).
The Dual Writer
So if you can tear down a system and bring it offline, data migrations can be much easier and much less risky since there are no writes going on. That can be negotiated with the business partners in your organization. The good thing about this approach is that it is straightforward, low risk, and easy to implement. The challenge is that, first of all, you will disrupt your user experience. Secondly, you could easily be losing money or harassing your brand by doing that. Thirdly this might not be so easy. If you have a legacy monolith, you might be required to tear down the whole business due to multiple systems accessing the same database, which is a much harder sell and have much bigger consequences. Not all business have the same coupling issues, and not all business are equal; if you can find a window and make this strategy work for you, fine, otherwise keep reading :-) I personally made this approach sometimes, and this is not shaming, and nothing wrong with it if it makes sense and is affordable.
Plus, this process will work well when the migration is one-direction(even ONLINE). Imagine one system WRITE/READS(Source of Truth), but the other just READS data. This pattern can work if you can afford the downtime during the switch and make the whole thing atomic. So either is all or nothing; you cannot work incrementally with the data here. Code changes might be able to be done in an incremental fashion that will be cover when I explain the Strangler pattern in detail(later on this post).
Dual-Writes the Hard route
Some NoSQL systems might not support CDC(Change Data Capture). For that case, you might need to go the hard route. Which means have Dual Write + Forklifts? Why is this hard? Well, let's understand who this pattern works, and we can talk about the challenges here. Dual Write will mean we do sort of a shotgun write, meaning we write in 2 different databases. Dual-write works well for new data and from now-on, let's say, but what happens with the old data? Well, we need to migrate that data with a different process with is the Forklift process. Both processes combined can make sure the 2 DBs are in Sync. Sounds easy, right? Well, not really. Lots of things can go wrong.
With Dual-Write and Forklift, we have a component between legacy and the new service dual write in both databases. That component(Dual Write) can either be the 3rd component or a new service itself. I would prefer a new component to have the new service interface/contract in a cleaner state. The Dual-write component needs to handle transactions and make sure the write happens anatomically in both systems, so the challenges are:
* What happens if one system is down? Do we fail the whole thing or partially persist?
* If we are doing partially persistence, how we make sure we dont lose data? Do we persist that state in one of the DBs? Do we use a 3rd DB?
* What happens if the schemas of both apps(legacy and new DB) are different? What happens if they have different database systems like MySQL and Cassandra?
It's possible to overcome all these problems but as you can see this solution can get pretty complicated easily. The Dual Write can serve as a Transformation layer; it's possible to apply any schema or data transformation. However, that might need to be table by table, and this component might get big, so you will need to make sure it gets modular. Any change in any side will break this component since the Dual Writer is pretty much the GOF Adapter pattern and easily breakable.
The Forklift - LAST_MODIFIED approach
We are not done yet. We still have the forklift. The forklift needs to move the old data. The main problem is that you want to avoid re-scanning the same table multiple times, which can be done either by having a LAST_MODIFIED column for every table for all schemas. That column would have the TIMESTAMP of the last modification. This works well, but you need to augment the legacy systems every time there is a change; we need to make sure this collum gets updated. The challenge here is numerous such as:
* Since we are talking about legacy systems, you need to make sure you dont miss any part of the applications - you might not have a central place to it. IF you miss it, you can lose data.
* Possible there is also Database code, meaning, not only application code but including triggers, stored procedures, functions, and other obscure and por forms of computing/coding.
* Strangler Pattern might help you do some centralization, but it does not change the problem's nature or the risks.
* What happens if legacy systems have different clocks? Well, that can also blow the whole system.
* There is a big refactoring to introduce this change and also a big effort to keep it up to date, remember this is for all columns for all tables. This approach of the MODIFIED_COLUMN would not have the old values, only the new values.
Keep in mind this approach also have a big flaw for Deleted data. You won't capture them since the data will be gone from the real table. You can fix this problem by doing LOGICAL deletion, but you might need to refactor your legacy to introduce that, so it might not be a good idea.
The Forklift - Trigger approach
We could have a Trigger in all tables for every INSERT or UPDATE to create a record in a SHADOW table(per real table) with the primary key or ROW_ID and the snapshot of the change. That approach would be more centralized and would not require changing the application code. For sure, it will be a much smaller refactoring and much easier to maintain. The issue here is that we introduce overhead on the database for regular transactions.
Remember that you might be doing this for 3 reasons(BigData/Analytics, Upgrades, Slice, and Dice the Monolith/Legacy). So if your use case if BigData paying extra overhead for analytical workloads, it's not ideal and could not be acceptable depending on your scale.
Plus, this tigger approach could be sounding pretty sexy; keep in mind that: Multiple changes could be lost on this design, you only will have the last change, or you would need to have multiple records making it even more expensive and hard to maintain - potentially even creating space issues(If you ware not on the cloud or not fully on the cloud).
The Forklift - Diff/Delta approach
Another option without using modified_collumns and trigger is to build a diff/delta system, but you will need to store this state somewhere, and you will need to capture Updates/Deletes and Inserts. Diff/Delta system could easily deal with deletes better than last_modified_column, but would that be less overhead they trigger? Remember, we are not considering CDC logging yet. Without considering CDC event logs, I would say the Diff would be pretty expensive and resource-intensive. Diff/Delta can work well for offline DBS or data loaded in replicas but real-time, and data-intensive DBS would be pretty bad since it would be resource-intensive.
Okay, so you can see at this point that Dual-Writes + Forklift is a lot of work, and if you can avoid it, it's better. Unfortunately, sometimes you might not be able to avoid it, but I would not say that is most use cases.
Enter CDC - Change Data Capture
CDC is not much different from the Forklifter we saw before. There are 4 types of CDC:
1. Modified Column
2. Trigger
3. Diff/Delta
4. Log Based
I won't talk more about types 1, 2, and 3 because I mention them a lot when describing the Forklifter. Now I will focus on what really matters, the type 4 - Log Based. This is what we call in Brazil (the Jump of the Cat "O Pulo do Gato"). Most databases have a binary event long with changes that happens. If you read from that log, you can actually know exactly what's going on, have no refactoring need, and do not have performance issues. So thats really the way to go at Scale.
CDC Log Based is the way to go. Debezium is a framework that can help you with that. What about bi-directional changes? We can have CDC on the new DB as well in this case we would be able to deal with bi-directional changes. How is that different from what ORACLE Golden Gate was done ages ago? not much different actually. So if you need to do transformations you still can code and make changes before/after events go to Kafka. Sure we still are fragile to changes here so ideally you want to pig bank on the Strangler pattern and deal with schema changes later.
So that's it for this blog post. Big enough :-) Next one I will cover the Strangler pattern in more detail and how it's relevant to CDC. I hope you guys like it.
Cheers,
Diego Pacheco