Get started with Kafka

Jun 14, 2020 09:08 · 1374 words · 7 minute read stream processing big-data kafka

Today, we’ll do a brief intro to Apache Kafka. Kafka is a message broker (you’ll also often read event streaming system or platform) developped at LinkedIn in 2011 and open sourced soon after.

There were already quite a few message brokers available at the time but none of them were a good fit for LinkedIn’s use cases (we’ll discuss their uses cases in a second) as they were transitioning from offline batch processing to real time event streaming.

Most of the message brokers available at the time (ActiveMQ, RabbitMQ, etc.) were providing quite a few features that proved to be unnecessary for LinkedIn and that, when removed, allowed Kafka to deliver 5 times more messages per second. Apache Kafka grew way beyond LinkedIn and is now used at scale by more than 80% of all Fortune 100 companies.

This post is the first of a series and it will give you an overview of what is Apache Kafka and how it is used. We’ll start by its use cases and then briefly describe its main concepts and its architecture.

Kafka use cases

LinkedIn started working on Kafka for implementing a few real time feedback loops:

  • advertising: showing the best ads based on real time click-through rate
  • content discovery optimisation: optimising search results sorting and show things like similar profiles, jobs, and companies based on your user profile and real time click-through rate of people like you
  • user facing analytics: things like who viewed your profile and who searched for your profile
  • monitoring and security: detecting spammers and people violating the website terms of service
  • analytics: Kafka as a source to feed different offline batch processing jobs.

Kafka’s popularity also increased with the increased popularity of the microservices architecture where applications moved from a single monolithic application to a suite of services that can be deployed and scaled independently. Following this approach, Kafka can be used for facilitating the decoupling of services. One service can produce data to a topic and any other service can consume from the data produced to this topic. The only thing they need to agree on is the schema of the data. An example of this approach is the following,

  1. Service A extracts data from an external data provider and produces data to Topic X
  2. Service B consumes data from Topic X, cleans and dedupes the data and produces the transformed data to Topic Y
  3. Service C consumes data from Topic Y, applies a machine learning model to predict a given behaviour and produces the data to Topic Z

Topic Z can then be used as a data input to many other services that will rely on the generated data or can be used to populate a datastore exposed through an internal or external API.

These two examples give you a quick overview of the differents things that can be achieved with Apache Kafka but there are many more. Kafka documentation lists some of them and gives you a good description.

OK… now that we’ve got a better idea of when to use Kafka, we’ll explore its key concepts and have a quick look at its architecture.

Kafka core concepts

Kafka has been designed to have the lowest overhead possible for data processing. Its design is relatively simple and there are just a few key concepts to grasp. We’ll explore them below.

Topic

Producers send data to topics and consumers read data from topics.

Kafka topics maintain a log. The log is an ordered, immutable list of messages.

Topic
+------------------------------+
|  log                         |
|  +---+---+---+---+---+ - +   |
|  | 0 | 1 | 2 | 3 | 4 | 5 |   |
|  +---+---+---+---+---+ - +   |
|                              |
+------------------------------+

Partitions

The log for a kafka topic is divided into multiple partitions. They serve as a way to balance the load.

Topic
+------------------------------+
|  log                         |
|                              |
|  Partition 1                 |
|  +---+---+---+---+---+ - +   |
|  | 0 | 1 | 2 | 3 | 4 | 5 |   |
|  +---+---+---+---+---+ - +   |
|                              |
|  Partition 2                 |
|  +---+---+ - +               |
|  | 0 | 1 | 2 |               |
|  +---+---+ - +               |
|                              |
|  Partition 3                 |
|  +---+---+---+ - +           |
|  | 0 | 1 | 2 | 3 |           |
|  +---+---+---+ - +           |
|                              |
+------------------------------+
Old -----------------------> New

Each record in a partition has a sequential and unique identifier called offset.

Producers

A Producer is a client (usually an application) that sends messages to the kafka cluster.

The data is written to a kafka topic. For each message,

  • if no key is specified, the producer determines the partition to write to in a round-robin scheduling (circular order). First message goes to partition 1, next message to partition 2, and so on.
  • if a key is specified, the producer determines the partition based on a deterministic hash of the key. If the same key is produced again, it will be sent to the same partition again.

Consumers

A consumer is the application that reads the messages.

Consumers read data from topics partitions and keep track of their progress storing their current offset. You only need to reset a consumer offset to make it reprocess the same messages again.

Messages are read in order (think of a log processing). You can have any number of consumers for a topic and they can all process the same records.

Consumer groups

By default, all consumers will consume all records.

You can place consumers into consumers groups to avoid two instances processing the same record. Each record will be consumed by exactly one consumer per consumer group.

                            Topic
                            +------------------------------+
                            |  log                         |
                            |                              |           Consumer group 1
                            |  Partition 1                 |           +----------------+
                            |  +---+---+---+---+---+ - +   |           | +------------+ |
          +------------------> | 0 | 1 | 2 | 3 | 4 | 5 |  <--------------| consumer 1 | |
          |                 |  +---+---+---+---+---+ - +   |           | +------------+ |
          |                 |                              |           |                |
          |                 |  Partition 2                 |           | +------------+ |
+----------+                |  +---+---+ - +               |     +-------| consumer 2 | |
| producer |-----------------> | 0 | 1 | 2 | <-------------------+     | +------------+ |
+----------+                |  +---+---+ - +               |           +--|-------------+
          |                 |                              |              |
          |                 |  Partition 3                 |              |
          |                 |  +---+---+---+ - +           |              | 
          +------------------> | 0 | 1 | 2 | 3 | <------------------------+ 
                            |  +---+---+---+ - +           |
                            |                              |
                            +------------------------------+

Note that consumer 1 consumes records from the partition 1 and consumer 2 consumes records from the partitions 2 and 3, so each will be consumed only once by consumer group 1.

If you have more consumers than partitions, some of the consumers will be idle.

Kafka architecture

We know have an idea of the main Kafka concepts (and its main APIs). Let’s have a look at its architecture. It will be quite short as there is not much to explain there.

Brokers

Brokers are the servers that make a kafka cluster.

Kafka cluster
+--------------------------------------------------------+
|                                                        |
|  +--------------+  +--------------+  +--------------+  |
|  |   Broker 1   |  |   Broker 2   |  |   Broker 3   |  |
|  | (controller) |  |              |  |              |  |
|  |   zookeeper  |  |   zookeeper  |  |   zookeeper  |  |
|  +--------------+  +--------------+  +--------------+  |
|                                                        |
+--------------------------------------------------------+

The controller

In a kafka cluster, one broker exactly is designated as the controller. The controller coordinates the process of assigning partitions and data replicas to nodes in the cluster.

Zookeeper

Zookeeper is a centralised service for maintaining the cluster configuration information.

Kafka uses Zookeeper for the following tasks:

  1. detecting the addition and the removal of brokers and consumers
  2. triggering a rebalance process in each consumer when the above events happen
  3. maintaining the consumption relationship and keeping track of the consumed offset of each partition.

Conclusion

And… that’s it! I hope you now have a better idea of what Apache Kafka is and when to consider using it. We’ll get our hands dirty in the part 2 of this series. See you there!

References

tweet Share