“There were lots of databases and other systems built to store data, but what was missing in our architecture was something that would help us to handle the continuous flow of data”.
We’ve come to think of Kafka as a streaming platform: a system that lets you publish and subscribe to streams of data, store them, and process them, and that is exactly what Apacha Kafka is built to be.
Kafka is often compared to a couple of existing technology categories: enterprise messaging system, big data systems like Hadoop, and data integration or ETL tools. Each of these comparisons has some validity but also falls a little short.
- Kafka is like a messaging system in that it lets you publish and subscribe to streams of messages. However, Kafka has a number of core differences from traditional messaging systems that make it another kind of animal entirely. First, it works as a modern distributed system that runs as a cluster and can scale to handle all the application in even the most massive of companies. Secondly, Kafka is a true storage system built to store data as long as you might like. Finally, the world of stream processing raises the level of abstraction quite significantly.
- Another view on Kafka was to think it as a kind of real-time version of Hadoop. Hadoop lets you store and periodically process file data at a very large scale. Kafka lets you store and continuously process streams of data, also at a large scale. What this comparison misses is that the use cases that continuous, low-lentency processing opens up are quite different from those that naturally fall on a batch processing system.
- The final area Kafka gets compared to is ETL or data integration tools.
Publish/Subscribe Messaging
It is important for us to understand the concept of publish/subscribe messaging and why it is important. Publish/subscribe messaging is a pattern that is characterized by the publisher of a piece of data not specifically directing it to a receiver. Instead, the publisher classifies the message somehow, and the subscriber subscribes to receive certain classes of messages. Pub/sub systems often have a broker, a central point where messages are published to facilitate this.
The unit of data within Kafka is called a message. A message is simply an array of bytes as far as Kafka is concerned, so the data contained within it does not have a specific format or meaning to Kafka. A message can have an optional bit of metadata, which is referred to as a key. The key is also a byte array and, as with the message, has no specific meaning to Kafka. Keys are used when messages are to be written to partitions in a more controlled manner. The simplest such scheme is to generate a consistent hash of the key, and then select the partition number for that message by taking the result of the hash modulo, the total number of partitions in the topic. This assures that messages with the same key are always written to the same partition.
For efficiency, messages are written into Kafka in batches. A batch is just a collection of messages, all of which are being produced to the same topic and partition. An individual roundtrip across the network for each message would result in excessive overhead, and collecting messages together into a batch reduces this. Of course, this is a tradeoff between latency and throughput: the larger the batches, the more messages that can be handled per unit of time, but the longer it takes an individual message to propagate. Batches are also typically compressed, providing more efficient data transfer and storage at the cost of some processing power.
While messages are opaque byte arrays to Kafka itself, it is recommended that additional structure, or schema, being imposed on the message content so that it can be easily understood.
Messages in Kafka are categoried into topics. The closest analogies for a topic are a database table or a folder in a filesytem. Topics are additionally broken down into a number of partitions. Going back to the “commit log” description, a partition is a single log. Messages are written to it in an append-only fashion, and are read in order from the beginning to end. Note that as a topic typically has multiple partitions, there is no guarantee of message time-ordering across the entire topic, just within a single partition. Partitions are also the way that Kafka provides redundancy and scalability. Each partition can be hosted on a different server, which means that a single topic can be horizontally across multiple servers to provide performance far beyond the ability of a single server.
Kafka clients are users of the system, and there are two basic types: producers and consumers. There are also advanced client APIs, Kafka Connect API for data integration and Kafka Streams for stream processing. The advanced clients use producers and consumers as building blocks and provide higher-level functionality on top.
Producers create new messages. In general, a message will be produced to a specific topic. By default, the producer does not care what partition a specific message is written to and will balance messages over all partitions of a topic evenly. In some cases, the producer will direct messages to specific partitions. This is typically done using the message key and a partitioner that will generate a hash for the key and map it to a specific partition. This assures that all messages produced with a given key will get written to the same partition. The producer could also use a custom partitioner that follows other business rules for mapping messages to partitions.
Consumers read messages. The consumer subscribes to one or more topics and reads the messages in the order in which they were produced. The consumer keeps track of which messages it has already consumed by keeping track of the offset of messages. The offset is anothe bit of metadata, an integer value that continually increases, that Kafka adds to each message as it is produced. Each message in a given partition has a unique offset. By storing the offset of the last consumed message for each partition, a consumer can stop and restart without losing its place. Consumers work as part of a consumer group, which is one or more consumers that work together to consume a topic. The group assures that each partition is only consumed by one consumer. The mapping of a consumer to a partition is often called ownership of the partition by the consumer. In this way, consumers can horizontally scale to consume topics with a large number of messages. Additionally, if a single consumer fails, the remaining members of the group will rebalance the partitions being consumed to take over for the missing member.
A single Kafka server is called a broker. The broker receives messages from producers, assigns offset to them, and commits the messages to storage on disk. It also services consumers, responding to fetch requests for partitions and responding with the messages that have been committed to disk. Depending on the specific hardware and its performance characteristics, a single broker can easily handle thousands of partitions and millions of messages per second.
Kafka brokers are designed to operate as part of a cluster. Within a cluster of brokers, one broker will also function as the cluster controller (elected automatically from the live members of the cluster). The controller is responsible for administrative operations, including assigning partitions to brokers and monitoring for broker failures. A partition is owned by a single broker, and that broker is called the leader of the partition. A partition may be assigned to multiple brokers, which will result in the partition being replicated. This provides redundancy of messages in the partition, such that another broker can take over leadership if there is a broker failure. However, all consumers and producers operating on that partition must connect to the leader.
Why Kafka?
Kafka is able to seamlessly handle multiple producers, whether those clients are using many topics or the same topic.
In addition to multiple producers, Kafka is designed for multiple consumers to read any single stream of messages without interfering with each other. This is contrast to many queuing systems where once a message is consumed by one client, it is not available to any other. Multiple Kafka consumers can choose to operate as part of a group and share a stream, assuring that the entire group processes a given message only once.
Not only can Kafka handle multiple consumers, but durable message retention means that consumers do not always need to work in real time. Messages are committed to disk, and will be stored with configurable retention rules. These options can be selected on a per-topic basis, allowing for different streams of messages to have different amounts of retention depending on the consumer needs.
Kafka’s flexible scalability makes it easy to handle any amount of data. Producers, consumers, and brokers can all be scaled out to handle very large message streams with ease.