S3: The Next Distributed Monolith?

S3 is an object storage that scales within reasonable performance. S3 can be used for a variety of use cases like Big Data analytical workloads, data pipelines, and AI/ML pipelines. S3 is big in the data universe but is also used in services and microservices for a variety of other use cases like file sharing, document management, legal signatures, and many other use cases. S3 is not a relational database, however, from the isolation point of view, S3 should be seen in the same way we see data stores. Since 2020, S3 guarantee strong consistency Read-After-Write. S3 performance handle reads up to 5500 rps and 3500 rps for write. S3 is easy to use and reliable, now everything that is popular and useful can be misused. 

S3 and Data formats


S3 allows us to store several types and kinds of data formats. From txt files, binary files, CSV, JSON, XML,  Parquet, Iceberg, Avro, ORC. Data formats have different tradeoffs like schema evolution, compression, suitability, and row or column orientation, which are best for reading or writing. 

S3 - All kinds of formats and use cases

Depending on your use case, one format might make more sense than another format. For instance, CSV, TXT, and JSNO will be easier to read but not necessarily will provide the best performance. Clearly, your use case will determine how you should organize the data on S3. 

Enter the Distributed Monolith


Replace S3 for a Relational Database and you have a distributed monolith. Here we have something much simpler and might not sound as bad as a shared relational database. Consider a simple Image sharing use case. Let's say we are building a system that has a booking on vacation homes, cars, and plane tickets. Cars and vacation homes require pictures, no one will book a car or house without seeing the pictures. 

Imagine we have 3 services, Vacation House Service where you can book a vacation house, service needs to retrieve the pictures of the house, neat by attention, and maybe even outdoor pictures of restaurants, nature, and any other good stuff around the vacation address. 
S3 - Image Sharing use case (Distributed Monolith)

Car Rental Service takes care of booking for cars, insurance, driver's license documentation, contract signature, and all nitty-gritty details related to car booking. Cars require pictures, not only from the exterior but from the interior of the vehicle. Rental services could take pictures when the rental is done to assess damages during the rental. Car Rental service is the second service that needs to read and write pictures in S3.

Finally, we have the Reports Service where fraud detection will happen and recommendations will be processed. In order to process such reports, the reports service needs to read the images from S3 in order to perform big data processing.

All services read all images from the same S3 bucket, using the same folder. The first issue here is that the Big Data system can slow down the online services and vice-versa. Secondly, what happens if car images are deleted by the car service, well Big Data cannot process fraud detection anymore. Okay, this second problem is easy to fix, don't allow deletion to happen from Car Rental Service. 

What happens if we want to process the images? Do we need to change all 3 services? What if we want to change the format of the images to higher evolution? How can we transparently generate thumbnails? How can we have centralized fine control over how to write vs. how to read the data there?

What if the House service needs to change the resolution but the car doesn't need to. We will be forced to handle change in both services? How can you enforce different resolutions for different services if they are all written in the same place? 

Different Folders


By adding more folders, we can do better security and improve performance. However, we still have lots of previous problems. Folders also help us to have granular policies for access and security. Ultimately house pictures and very different from car pictures, houses will have more pictures and require more than car pictures. 
 
Adding some level of Isolation

Such change already provides some benefits because now when we change something related to cars it will not affect houses. We can do better, and we should. Maybe this is all you need and is good enough. However SOA should be applied here, and all communications should be done via the service interface not via direct storage. We still can do better.

Another Bucket


Now we could literally move houses to a different bucket. Instead of one bucket, we have 2. From isolation perspective is cleaner and easier to maintain and also understand the cost implications. We can tag buckets which can be used both for cost observability but also security.

Different S3 Buckets

Here we have two buckets, which easily could be dozens to hundreds. Some use cases might require much more performance and low latency whereas eFS could be better. Perhaps we need some level of indication.

Adding a Library


We could add a wrapper library that would act as an "Image Service". You must be wondering why not create a service. If most of the time you are just moving the file to S3, an internal shared library will be better because you will upload the file directly to the final destination. 

Service will make sense if we do extra processing like thumbnail generation. We do not necessarily need a service because we could do it with a Big Data approach and just have a Data Pipeline to process the thumbnails as a background job. 

Having an interface in from of S3 has benefits. Let's say you want to save costs. There are open-source solutions like Ceph, GlusterFS, and many others that could be leveraged in various scenarios for cost savings but also to migrate out of S3 if necessary. Even the opposite could be possible, starting with Ceph and Gluster and offline some use cases or all to S3, it would be all transparent by having a contract in front of the datastore. Such a contract can be a library or can be a service. 

Is all about Self-Containment

No matter if you will do an internal shared library or a service. You must apply service thinking. For instance, all your S3 buckets should be in a catalog. Which can be dynamically discovered or statically structured like a simple JSON file on github. 

Having a library/service allows you to introduce validations, and observability and better understand the consumer use cases. However, keep in mind libraries have other traps associated with them. Distributed Monoliths are not fun. S3 can have all sorts of formats and use cases there, the same principles should be applied. We should isolate MySQL in the same way we would isolate S3 or even Redis. 

One must be thinking, why add complexity when we can just call all things directly. Because isolation is the only way to protect your team and solutions from high coupling. What if you can end up with hundreds and tousands of buckets, how you will manage them at scale, how you make you can delete something there or need to keep it? How do you assign ownership? S3 should be managed like any other datastore and has the risks of creating Distributed Monoliths that are not as bad as a Relational Database but still can have the same blast radius issues.

Cheers,
Diego Pacheco

Popular posts from this blog

Kafka Streams with Java 15

Rust and Java Interoperability

HMAC in Java