System Design

How Kafka Works

The log-based distributed system that powers Netflix, LinkedIn, and Uber's data pipelines.

Log-Based Storage Partitioned Topics Replay History Official Docs
7 trillion
messages/day
LinkedIn throughput
< 10
ms end-to-end
Typical latency
1 GB
per segment file
Segment size default
1 MB
default (configurable)
Max message size

The Core Idea: It's a Distributed Log

Kafka is fundamentally an append-only distributed log, not a message queue. Unlike RabbitMQ where messages are deleted after consumption, Kafka retains all messages on disk for a configurable period. Any consumer can read any message, any number of times, as long as it's within the retention window.

💡 Think of Kafka like a database commit log — ordered, immutable, durable. Consumers are just reading positions in the log. The log doesn't care who reads it or how many times.

Data Flow Diagram

Producer
Kafka Cluster
Topic: orders
P0
P1
P2
Broker(s)
Consumer Group
C1
C2
C3
Consumers
Step 1 Producer Sends records to a topic

The producer sends a record (key + value + headers) to a named Topic. Kafka's producer client batches records, compresses them, and sends them to the leader broker for the target partition.

Producers choose which partition to write to — either round-robin (no key), by key hash (consistent routing), or via a custom partitioner.

producer.send(new ProducerRecord<>(
  "orders",        // topic
  orderId,         // key → determines partition
  orderJson        // value
));
Step 2 Topic & Partitions Parallelism unit

A Topic is a named stream, split into N Partitions. Each partition is an ordered, immutable log. Partitions are the unit of parallelism — more partitions = more throughput, but also more overhead.

Records with the same key always go to the same partition, preserving order per key. Partition count is set at topic creation and determines max consumer parallelism.

# Create topic: 6 partitions, 3 replicas
kafka-topics.sh --create   --topic orders   --partitions 6   --replication-factor 3
Step 3 Broker & Leader Stores & replicates data

Each partition has one Leader broker that handles all reads/writes, and N-1 Follower brokers that replicate data. Kafka Controller (or KRaft in newer versions) manages leader elections.

Followers are kept in sync via In-Sync Replicas (ISR). Producers can configure acks=all to wait for all ISR replicas before acknowledging a write.

# Partition 0 of "orders":
# Leader  → broker-2
# Replica → broker-1
# Replica → broker-4  (ISR)
Step 4 Consumer Group Reads in parallel

Consumers in the same group each get assigned a subset of partitions. Each partition is consumed by exactly ONE member of a group, ensuring each record is processed once per group.

Multiple independent groups can read the same topic simultaneously, each maintaining their own offset. This enables fan-out without data duplication.

# 3 consumers in "payment-service" group
# with 6 partitions:
# consumer-1 → partitions [0, 1]
# consumer-2 → partitions [2, 3]
# consumer-3 → partitions [4, 5]
Step 5 Offset Tracking Position in the log

An offset is a monotonically increasing integer per partition. Consumers commit their offset to the __consumer_offsets internal topic after processing, enabling resume-from-last-position on restart.

Committing before processing → at-most-once. Committing after → at-least-once. Transactional APIs enable exactly-once semantics.

// Manual offset commit (at-least-once)
consumer.poll(Duration.ofMillis(100))
  .forEach(record -> {
    process(record);
    consumer.commitSync(); // commit after
  });
Step 6 Retention Log-based, not queue-based

Unlike traditional queues, Kafka does NOT delete records when consumed. Data is retained for a configurable time (7 days default) or size. Consumers can re-read old data by seeking to any offset.

This is Kafka's superpower: new consumers can replay history, and the same data can power analytics pipelines, caches, and operational systems independently.

# Retention config per topic
retention.ms=604800000    # 7 days
retention.bytes=50GB
# Or log compaction (keeps latest per key)
cleanup.policy=compact

Log, Not Queue

Records are retained after consumption. Any consumer can replay. This enables event sourcing, audit logs, and CDC pipelines.

Scale via Partitions

More partitions = more parallelism. But more partitions also means more leader elections and higher rebalance cost. Start with 3-6 per topic.

OS Page Cache

Kafka doesn't maintain its own cache. It relies on the OS page cache. This means faster consumers serve from RAM, slower ones hit disk — all transparently.

The Engineering of Apache Kafka: Reimagining the Distributed Log

Traditional message queues (like RabbitMQ or ActiveMQ) treat messages as ephemeral—once a message is read, it is instantly deleted from the server. Apache Kafka fundamentally rejected this design. Instead of a transient queue, Kafka is engineered as an immutable, append-only, distributed commit log. In Kafka, messages are permanently written to disk and retained for days, weeks, or even years, allowing consumers to rewind time and replay history at will.


Part 1: The Physics of the Disk

Kafka was built by LinkedIn engineers to handle trillions of events per day. To achieve this, it relies on a counterintuitive hardware truth: Sequential disk I/O on a modern HDD/SSD is nearly as fast as random memory (RAM) access.

Because Kafka topics are append-only logs, every single write operation is purely sequential. The disk never has to physically seek or fragment data; it just streams bytes to the end of the file. To maximize this efficiency, Kafka leverages the Linux OS Page Cache. It does not maintain its own in-process memory cache (which would trigger brutal Java Garbage Collection pauses). Instead, it leaves all caching to the kernel.

When reading data, Kafka utilizes the sendfile() system call (Zero-Copy). Data is literally DMA-transferred directly from the Linux Page Cache to the Network Interface Card (NIC) socket without ever being copied into Kafka's user-space application memory. This is why a single Kafka broker can saturate a 10 Gigabit network link using almost zero CPU.

Part 2: Partitions (The Unit of Parallelism)

A single log running on a single server is an inevitable bottleneck. Kafka solves this by shattering a Topic into multiple Partitions.

If you create an "orders" topic with 6 partitions, Kafka creates 6 distinct log files distributed across multiple different broker machines. When a Producer sends a message, it includes a "Key" (e.g., the CustomerID). Kafka calculates the hash of this key (Hash(CustomerID) % 6) to determine which exact partition the message lands in.

This routing mechanism guarantees that all events for a specific Customer are strictly ordered within the same physical partition, while simultaneously allowing the overall topic throughput to scale horizontally across the cluster.

Part 3: Consumer Groups and Rebalancing

Reading from a massive, high-velocity topic requires multiple machines working in parallel. Kafka orchestrates this using Consumer Groups.

The Golden Rule of Consumer Groups is: Every Partition can be read by exactly ONE Consumer within a given group.

  • If you have 6 Partitions and 2 Consumers, each Consumer gets 3 Partitions.
  • If you have 6 Partitions and 6 Consumers, each gets exactly 1 Partition (maximum parallelism).
  • If you deploy 10 Consumers for 6 Partitions, 4 Consumers will sit completely idle.

When a Consumer crashes, Kafka detects a missed heartbeat and triggers a Rebalance. The entire group briefly pauses, and the abandoned partitions are mathematically reassigned to the surviving consumers, ensuring no data is left behind.

Part 4: Offsets (The Cursors of History)

Because Kafka never deletes data when a consumer reads it, the broker does not track what has been read. Instead, the Consumer tracks its own position using an integer called an Offset.

As a consumer successfully processes records, it periodically "commits" its highest processed offset back to the Kafka cluster (stored securely in a hidden internal topic named __consumer_offsets). If the Consumer crashes and is replaced, the new Consumer queries this internal topic, finds the last committed offset, and resumes processing exactly where the previous instance left off.

Defining Delivery Guarantees

The timing of the offset commit defines the architectural guarantee:

  • At-Most-Once: Commit the offset before processing the block. If you crash during processing, the data is skipped forever. (Fastest, data loss possible).
  • At-Least-Once: Commit the offset after successfully processing the block. If you crash after processing but before committing, the next consumer will re-process the same block. (Standard, requires idempotent consumers).
  • Exactly-Once: Requires Kafka's Transactional API to atomically link the database writes with the offset commit in a single, coordinated transaction. (Highest latency, mathematically perfect).

Conclusion: The Central Nervous System

By shifting from ephemeral queues to a durable, partitioned log, Kafka decoupled data production from data consumption. A billing service can read a transaction in real-time, while an analytics pipeline can batch-read the exact same data 12 hours later, and a newly deployed fraud-detection microservice can rewind the log to day zero and replay 6 months of historical transactions to train its machine learning model. Kafka became the central nervous system of the modern enterprise architecture.

Glossary & Concepts

📝 Topic

A named category or feed to which records are published. Think of it as a logical grouping of events, similar to a table in a database, but implemented as an append-only log.

🧱 Partition

A physical sub-division of a topic. Partitions allow a single topic to be scaled across multiple brokers. Messages within a single partition are strictly ordered.

👥 Consumer Group

A group of consumers that cooperate to ingest data from a topic. Kafka assigns each partition to exactly one consumer within the group to ensure parallel, ordered processing.

📍 Offset

A unique, monotonically increasing integer assigned to each message within a partition. It acts as the message's ID and allows consumers to track exactly where they left off.