Why Big Data is Hard?

Yes, Big Data is hard. It's not hard only because of the number of technologies a good data engineering team needs to master technologies such as Spark, Flink, and Kafka Streams(Batch and Streaming), Hadoop, HDFS, and Hive if you have a DW legacy(most likely you do) and the Data Science part of it with Discovery and Execution at Scale. There are needs for different kinds of storage and Design/Modeling, and thats, not even the hard part. The technology landscape gets bigger and bigger as time pass. We have many specializations such as Frontend/Mobile engineering, Backend Engineering, Architecture, DevOps(Which is a movement, not a department, but all companies decide is a role, so you know what I mean), QA(a dying one?), Product, Management and Data Engineering which often has Data Scientists working with Data Engineers. To some degree, Data Engineering and Data Science have the same issues as Product has today. Unfortunately, the product folks still too much about project management as Data engineering is too much plain old SQL. 
Extracting value from Data

Early Big Data vendors were pushing for accumulating data(like in HDFS and later on S3) and later one once you have all the data them you figure out what you would do with all that data, often that does not work out simple. Information is useless, and we need wisdom, which means converting data into actions. Actions are more than insights and require different skills and an agile approach to learn and create value for customers. Running a Data Pipeline that crosses join 50% of you datasets and produces a report at the end does not necessarily mean value. Big Data still requires product discovery. 

The value does not mean just having good insights with your data but using that to drive value to your customers by fixing their problems far greater than just giving than raw analytics and dashboards for than to analyze trends by themselves. Domain Observability is great don get me wrong, but we need more than that.   

Inaccurate Data and Cleaning

A Big Problem is to have up to date and reliable data. Easily 60-80% of a data engineer's time could be consumed by data cleaning activities. This problem is about how we structure data and how SOA Services and Microservices deal with data. We need to remember that most companies have legacy systems often in old languages, and consistency is much more than having primary keys on the database. 

Why is that a problem? Well if you are training you data with old and inaccurate data you could easily get wrong predictions and easily you could be going in the wrong direction. Data munging is a necessary evil but is boring and time-consuming.  

Complexity resulted from Debit

Complexity often results from a lack of hygiene and cleaning up. If you just do features and often dont have time to improve your solutions overtime complexity and slowness start to charge a high price, not only in lack of productivity but also in slow and painful maintain time(mostly under migrations) or being stuff and any factor become too much expensive so a rewrite is needed from time to time. 

The debit is everywhere, technical, management, product and in big data as well. The debit is like a snowball in an avalanche as time passes it never gets better just get it worse. The reality is that at Scale you always gonna have something on fire and something that need to be re-write so management tends to prioritize and accept that never things will be perfect and that could be fine if teams would stop the bleeding and doing things properly and having proper reviews and keep improving things as they go but that does not happen and the combination produces disaster over time. 

Coupling, Governance, and Too much SQL

SQL per se is not bad but at the end of the day is bad. IF you use SQL as a high-level language you are fine, meaning that SQL will be converted to a low-level thing like happens in Spark and Flink for instance. But IF SQL is all you do you gonna have a problem, that problem often means coupling and comes with a lack of Governance. 

Analytics by nature means more coupling than regular services. There is value in augmenting data dont get me wrong but there is a need for some sort of domains or smaller lakes. One big massive like where you could join anything with anything is not the answer. 

Security and Privacy

Security and privacy are hard. The best way to deal with PII Data is to do data anonymization in that case you get the simple solution as possible with the best performance possible. However, sometimes you need to deal with PII Data and the solution does not have 1 key for the whole data lake. You need to introduce more keys and Caching could be an interesting mechanism to balance security with performance. 

Privacy is even harder(security is hard), depending on your requirements you might need to do some extra tracking which can be super hard. Imagine this: The User Opt-out from you using his data. How you are 100% sure you delete all his data? So, first of all, you have all data from your users linked? Are you gonna get all data archived in S3 or Glacier and load all that data to figure out what data need to be erased or not? Dealing with privacy is possible but if you dont start from the beginning with a solid foundation(like the security it's very hard to introduce after the fact without some massive impact in existing systems). 

Data Meshes and the Future

IMHO Data Meshes are the future. Data Meshes aim for Self-Service Data Platform with Product thinking using DDD. Data Meshes aim for a different way to decentralize the big data monolith, it's all about locality and ownership. It's all about owning and providing easy ways for to teams consume data. Data Meshes is about to reverse thinking, instead of thinking about ETL, Scoping (the traditional form of push and ingestion) shifting to Serving and Pull models. 

Data Meshes aim to see Dataset Domains as part of the product and even advocate to have product owners in this case DDPO(Domain Data Product Owner) which can be done by a Data Engineer or a Data Architect. IMHO one of the main and key important aspects is Ownership. Without ownership, it's very hard to have proper Governance and things tend to fall apart. Domains being part of the product means that they have Governance attributes such as Discoverability, Addressability, Trustable, Self-Descriving, Relly on OSS Standards, and Security.  Data meshes provide guidance for architecture and governance is a totally different approach than the previous 2 waves of big data. Just doing Kappa architecture is not enough. 

Cheers,
Diego Pacheco


Popular posts from this blog

Kafka Streams with Java 15

Rust and Java Interoperability

HMAC in Java