Kafka for Mere Mortals #2

I recommend to start with the previous part of the series before going ahead unless you have already read it.

Kafka is designed as a distributed system to achieve horizontal scalability, high availability (HA), and data redundancy. Horizontal scalability means we can keep adding more and more nodes to a cluster to handle virtually any imaginable load1. High availability describes a system that is always operational even if some of its components fail or get disconnected. Data redundancy is employed to ensure there is no data loss or corruption.

Although it is possible2 to create a single node Kafka cluster you would not typically do it in production. Cluster size is described by a simple formula3: $$n = 2f + 1$$ where f is the number of node failures a cluster can survive. A simple consequence of that is the minimum viable cluster size is 3.

As we know messages in Kafka are written to and read from topics. Topics can be divided into a number of partitions4 distributed across cluster nodes. Partitioning enables parallel processing within a single topic5. This plus cluster horizontal scalability enables virtually uncapped throughput. There is nothing wrong with single partition topics though. As we know every message in a topic (or more precisely in a partition) has an assigned offset. Order of the offsets is guaranteed for a single partition topic. However, this guarantee can not be held across partitions. As a consequence, single partition topics are a good match when order of processing is more important than scalability.

Loosing access to a partition if a node fails would be unacceptable. Replication makes data loss less likely. It is nothing more than having multiple copies (replicas) of a partition held on different nodes6. The number of replicas is defined by topic’s replication factor. The default value for small clusters is usually equal to the cluster size. In bigger clusters topics are usually replicated into a subset of nodes due to costs of replication (network bandwidth, storage).

Kafka’s depends on ZooKeeper which provides distributed coordination for systems build upon it. Kafka uses it for various tasks like topics configuration, cluster membership, and other beyond the scope of this post. ZooKeeper is also used by several other distributed applications so having its cluster deployed makes adoption of technologies like Hadoop easier.

We’ve barely scratched the surface here but it should be enough for know. In the next post I would like to focus on the practical side of things and develop a simple .NET publisher-subscriber app.

  1. It can be massive. This is what it was at LinkedIn in 2016. ↩︎

  2. And can be used in local development for basic use cases. ↩︎

  3. The formula comes from node majority quorum where a cluster can operate as long as a quorum greater than 50% of its nodes can be formed. ↩︎

  4. The topic of one partition is the simplest case and implies serial processing. ↩︎

  5. Independent topics are processed in parallel regardless ↩︎

  6. If it’s not obvious - it would not make much sense having a partition replicated twice in the same node as if the node fails both copies are lost. ↩︎