Skip to content
Low Level Design Mastery Logo
LowLevelDesign Mastery

Kafka Deep Dive

Distributed streaming platform for high-throughput event processing

Apache Kafka is a distributed streaming platform designed for building real-time data pipelines and streaming applications. Originally developed by LinkedIn and open-sourced in 2011, Kafka has become the de facto standard for event streaming in modern distributed systems.

Kafka operates as a distributed commit log - a persistent, append-only data structure that stores streams of records (events) in topics. Unlike traditional message queues that delete messages after consumption, Kafka retains messages for a configurable retention period, allowing multiple consumers to read the same messages at different times and speeds.

This design makes Kafka ideal for scenarios where you need to:

  • Decouple producers and consumers (they don’t need to be active simultaneously)
  • Replay events (consumers can read historical data)
  • Scale horizontally (add partitions and consumers independently)
  • Guarantee ordering within partitions
  • Ensure durability and fault tolerance through replication
Diagram
  1. Distributed - Runs on multiple servers (brokers)
  2. Fault-tolerant - Replicates data across brokers
  3. High-throughput - Millions of messages per second
  4. Scalable - Add brokers/partitions to scale
  5. Durable - Messages persisted to disk
  6. Real-time - Low latency streaming

A topic is a category or stream of messages in Kafka. Think of it as a named channel where producers publish messages and consumers subscribe to read them. Topics are similar to tables in a database or folders in a filesystem - they provide logical organization for related messages.

Key Characteristics:

Topics in Kafka are immutable logs - messages are appended to the end and never modified or deleted (until retention expires). This append-only design provides several benefits: it enables efficient sequential disk I/O, allows multiple consumers to read the same messages independently, and supports event replay for debugging or reprocessing.

Each topic is partitioned for scalability and replicated across multiple brokers for fault tolerance. Messages within a partition are strictly ordered, but ordering across partitions is not guaranteed unless you use a partitioning key.

Diagram

Characteristics:

  • Immutable log - Messages appended, never modified
  • Partitioned - Split into partitions for parallelism
  • Replicated - Copies across brokers for reliability
  • Ordered - Messages ordered within partition

A partition is an ordered, immutable sequence of messages within a topic. Each topic is divided into one or more partitions, which enables Kafka to scale horizontally and process messages in parallel.

Why Partitions Matter:

Partitions are Kafka’s fundamental unit of parallelism. Without partitions, a topic would be processed by only one consumer at a time, creating a bottleneck. With partitions, different consumers can process different partitions simultaneously, dramatically increasing throughput.

Partitioning Strategy:

When a producer sends a message, Kafka determines which partition to use based on:

  • Key-based partitioning: If a message has a key, Kafka uses a hash of the key to determine the partition. This ensures all messages with the same key go to the same partition, maintaining ordering for that key.
  • Round-robin partitioning: If no key is provided, Kafka distributes messages evenly across partitions in round-robin fashion.

Real-World Example: In an e-commerce system, you might partition the orders topic by customer_id. This ensures all orders from the same customer are processed in order, while orders from different customers can be processed in parallel.

Diagram

Why Partitions?

  • Parallelism - Multiple consumers process different partitions
  • Scalability - Add partitions to scale throughput
  • Ordering - Messages ordered within partition (not globally)

Partitioning Strategy:

  • Key-based - Same key → same partition (ensures ordering)
  • Round-robin - Distribute evenly (no key)

A consumer group is a set of consumers that work together to consume messages from one or more topics. Kafka automatically distributes partitions across consumers in the same group, ensuring each partition is consumed by exactly one consumer in the group.

How Consumer Groups Work:

When consumers in a group subscribe to a topic, Kafka performs partition assignment - it divides the topic’s partitions among the available consumers. If you have 3 partitions and 2 consumers, Kafka might assign partitions 0 and 1 to consumer 1, and partition 2 to consumer 2.

Rebalancing:

When consumers join or leave a consumer group, Kafka automatically rebalances - it redistributes partitions among the remaining consumers. During rebalancing, consumers stop processing messages, which can cause a brief pause. This is why it’s important to design consumers to handle rebalancing gracefully and commit offsets frequently.

Scaling Pattern:

  • Scale consumption: Add more consumers to the group (up to the number of partitions)
  • Scale production: Add more partitions to the topic (requires careful planning as partitions can’t be reduced)
Diagram

Key Rules:

  • One partition → one consumer (in same group)
  • One consumer → multiple partitions (can handle multiple)
  • Rebalancing - When consumer joins/leaves, partitions redistributed

Example:

  • Topic has 3 partitions
  • Consumer group has 2 consumers
  • Consumer 1 gets partitions 0, 1
  • Consumer 2 gets partition 2

An offset is a sequential number that uniquely identifies the position of a message within a partition. Offsets start at 0 and increment for each message. They are immutable - once assigned, an offset never changes.

Offset Management:

Consumers track their current offset - the position of the last message they’ve successfully processed. After processing a message, the consumer commits the offset, which tells Kafka “I’ve processed up to this point.” If the consumer crashes and restarts, it resumes from the last committed offset, ensuring no messages are lost and no messages are processed twice (assuming proper offset management).

Offset Commit Strategies:

  • Automatic commit: Kafka commits offsets periodically (default: every 5 seconds). Simple but can lead to duplicate processing if consumer crashes between processing and commit.
  • Manual commit: Consumer explicitly commits offsets after processing. More control, but requires careful implementation to avoid duplicates or lost messages.

Offset Storage:

Kafka stores committed offsets in a special internal topic called __consumer_offsets. This allows Kafka to track consumer progress even if consumers restart or rebalance.

Diagram

Offset Management:

  • Consumer tracks current offset
  • After processing, commits offset
  • On restart, resumes from committed offset

Diagram

Broker = Kafka server. Cluster = Multiple brokers.

Replication:

  • Leader - Handles reads/writes for partition
  • Followers - Replicate leader’s data
  • ISR (In-Sync Replicas) - Followers in sync with leader



Exactly-once semantics ensures that each message is processed exactly once, with no duplicates and no lost messages. This is the strongest delivery guarantee but also the most complex to implement.

In distributed systems, achieving exactly-once processing is challenging because:

  1. Network failures can cause retries, leading to duplicate messages
  2. Consumer failures can cause rebalancing, leading to duplicate processing
  3. Producer retries can send the same message multiple times
  4. Offset commits can fail, causing messages to be reprocessed

Achieving exactly-once semantics requires coordination across three components:

  1. Idempotent Producer

    • Prevents duplicate messages from retries
    • Uses producer ID and sequence numbers
  2. Transactional Producer

    • Atomic writes across partitions
    • Uses transactions
  3. Idempotent Consumer

    • Tracks processed offsets
    • Deduplicates messages

Partitions Enable Parallelism

Partitions allow multiple consumers to process topic in parallel. Ordering guaranteed per partition.

Consumer Groups Scale

Consumer groups distribute partitions across consumers. Add consumers to scale throughput.

Offsets Track Progress

Offsets track consumer position. Commit offsets to resume after restart. Critical for reliability.

Exactly-Once is Complex

Exactly-once requires idempotent producer, transactions, and idempotent consumer. Use when duplicates are critical.