Kafka for Mere Mortals #1

This series of posts is an attempt to explain what Apache Kafka is and to show some benefits of using it side by side with a relational database in a .NET (or really any) distributed solution. I am by no means expert in any of those technologies so take all of that with a pinch of salt. I am assuming you will have a basic knowledge of relational databases and some experience in building applications.

All the years spent on working with SQL databases have taught me there are quite a few aspects of data architecture that have a tremendous influence on systems' performance and development process. The following I found important to my further introduction to Apache Kafka:

The key factor here is how a particular piece of data will be utilized. Will it be queried a lot in a real-time like application (so read performance is the key)? Or maybe just used for some offline analysis? In my experience we often would want the same data written in a few different forms solving various business or performance issues. From a developer point of view it was a bit annoying as it required designing schemas for a couple of new tables, adding SQL table creation scripts, some corresponding C# classes, and ORM mappings. This is a lot of overhead and requires an upfront knowledge which bits of data are important for a particular consumer. What if we could postpone these choices and design as we go?

How long do you need to store data in your production databases for? There is no single answer to this one. This may be constrained by legal requirements but often we keep lots of data that otherwise could be archived and it’s not needed for system’s current operational work. Excessive data in the system often cripples performance.

Once raw data is gathered it often needs to be transformed into some other forms involving aggregations which are often (but not only) performance optimizations. Let’s say we derive some metrics for many sets of data. How easy is to reprocess them in case we spotted a logic error that has been affecting those metrics for last two weeks?

Data flow depends a lot on overall system architecture. In the last few decades we’ve seen a lot of various technologies and architectures providing data circulation, storage, and systems' integration: web services of many flavours (RPC, REST), micro services, event buses, queueing systems, relational and No-SQL databases, and endless hybrids of all the above. SQL databases have been well understood, known as a good multi-purpose tool, and used across many industries. However, all the new tools could do what a multi-tool could not. They specialized in their very areas and thanks to targeted optimizations could provide superior performance. Apache Kafka is one of them.

Kafka is a persistent distributed data store which serves as a very fast integration bus using the publisher-subscriber model. Kafka’s data is organized into topics (aka streams) containing messages which usually are events and commands. Now let’s have a bit closer look into it.

Conceptually topic is a very simple data structure:

Offset Data
6364 { }
6365 { }
6366 { }

Figure 1: A conceptual representation of topic

This is pretty similar to a queue with the following assumptions:

  • Identified by a name.
  • Persisted to disk.
  • Offset is a sequence of numbers determining order of messages.
  • Data is an arbitrary1 content of message.
  • Messages are immutable (once inserted can not be updated).
  • Messages can be read multiple times.
  • Messages expire after a common configurable period and then get deleted.

There are a few more important characteristics but we’ll look into them later. Summing it up you can think about topics as ordered append-only logs of messages with a certain retention time.

Now having an idea what topic is imagine all data in Kafka is organized into topics. Topics have several properties with initial values given by server defaults. The defaults can be overridden per topic on creation or altered afterwards. A common one would be to customize default retention time. By default topics must be created before any data can be written into them and once created cannot be deleted. These both behaviors can be changed i.e. server can be configured to enable auto-creation and deletion of topics.

Kafka has its instances and these are called brokers. A few brokers can work together forming a cluster. Brokers in a cluster are called nodes. The main purposes of clustering are scalability, high availability, and data redundancy. It will be discussed more in the next part of the series but for now it can be simplified to the following: the bigger cluster is the more load (writes and reads) it can handle, and the more nodes can fail (or disconnect) without affecting overall cluster performance and data consistency. Due to Kafka design a minimal cluster should have no less than 3 nodes (it will be explained later).

I mentioned Kafka uses the publisher-subscriber model. In Kafka’s lingo the main actors are called producers and consumers. For now we only need to know producers write data into topics and consumers read it. We’ll discuss it further with far more details in the next post.

There are common misconceptions about Kafka. The most popular are:

  1. Kafka is like database. Not exactly. It does not allow querying the way SQL database does.
  2. Kafka is like message queue. Not exactly. Its data is persisted in a reliable manner and messages can be read many times.

Ok, it’s all cute but what you can do with it? Well, surprisingly nothing fancy. You can write and read data. But it can be loads of data and you can do it super fast. Your data is safely persisted on disk and disappears when you no longer need it2. Being a speed demon and supporting publisher-subscriber model Kafka is an excellent fit for near real time, event-driven, and reactive systems. It can become the data pipeline of a whole solution providing a single source of truth (vide Event Sourcing + CQRS patterns) in a raw form that then can be digested and transformed in any arbitrary way (possibly even unknown at the time data is written). Unlike in messaging queues when a message is read from a topic it stays where it was so it can be easily read again many times which gives us a dead-simple way of reprocessing the same data over and over (e.g. experimenting with different versions of code).

Now you should have an idea what Kafka is. This is just a rough sketch though. That’s why there will be a series of posts. In the next one I will focus on the distributed nature of Kafka.

  1. In theory it can be really any content - binary, JSON, YAML, you name it. However, in real systems it’s beneficial to agree upon a single format and organize your blobs with a set of schemas. One of the most common ways to tackle this problem is to have a schema server which keeps all the schemas across the organization. This way both programs and humans can understand and use all data stored in Kafka. ↩︎

  2. It disappears but you - as a sensible person - archive all the data flowing through Kafka into an external store (like SQL Server or Hadoop) where it can be kept, analysed, etc as long as you want. ↩︎