Fun with Apache Kafka

Apache Kafka is a high throughput messaging system created by LinkedIN in 2011 using Scala. Kafka is a high throughput system pretty much, in other words it means will handle the load you have, so you can use it as a buffer for back pressure or spooling mechanism.

Now a days people use Kafka a lot with other Big Data Technologies like Apache Storm, Apache Hadoop or even with Apache Spark, so there is a common pattern like you read data from Kafka, process in spark/storm/hadoop and store it in the end into a NoSQL database like Cassandra for instance.  People end up using Kafka for Analytics (that's what i meant by big data), monitoring, log activity and sometimes for building block as part of bigger architectures / systems. You can see more on LinkedIN view on this post. Kafka is Durable(can be persisted on DISK) and very FAST and can scale great deal of loads like 800 billions messages per day at LinkedIN.  Besides LinkedIN, Twitter, Netflix, Spotify, Mozilla and others also use Kafka.

Kafka Overview

Kafka is pretty simple to understand, not so simple to tune :-) It pretty much has the concept of messaging Producers: Components that write data into Kafka and Have Message Consumers: Components that read data from Kafka. Kafka does not implement Java Messaging System Spec(a.k.a JMS).

Kafka has a cluster as well, Kafka uses Apache Zookeeper in order to distribute and coordinate the cluster work. Apache Zookeeper is mature, well tested, totally battle tested.  If you use zookeper by your self you its recommended you use recipes like Curator project in order to avoid commons issues.

Data is stored into Topics in Apache Kafka, this topics are split into multiple partitions. The Partitions are replicated across the cluster.

When you write into a Topic you can write into multiple Partitions are at same time.

There is one great thing is you can read from the beginning of the Partition or the current moment of the Partition.  Its possible to have consumers in different offsets of the partition as well.

Kafka makes the Partition ORDERED and IMMUTABLE and the Sequence of messages is APPENDED in the END. Its possible to configure the partition of the topic as well this give you the control of max group parallelism for a consumer group. 

Replicas are pretty much backups of the partitions. This concept exist to prevent Data Loss. So you never read or write direct to a replica. 

Having Fun with Kafka 

Diego Pacheco

Popular posts from this blog

Podman in Linux

Java Agents

Manage Work not People