Kafka Deep Dive

Distributed streaming platform for high-throughput event processing

What is Kafka?

Apache Kafka is a distributed streaming platform designed for building real-time data pipelines and streaming applications. Originally developed by LinkedIn and open-sourced in 2011, Kafka has become the de facto standard for event streaming in modern distributed systems.

The Core Concept

Kafka operates as a distributed commit log - a persistent, append-only data structure that stores streams of records (events) in topics. Unlike traditional message queues that delete messages after consumption, Kafka retains messages for a configurable retention period, allowing multiple consumers to read the same messages at different times and speeds.

This design makes Kafka ideal for scenarios where you need to:

Decouple producers and consumers (they don’t need to be active simultaneously)
Replay events (consumers can read historical data)
Scale horizontally (add partitions and consumers independently)
Guarantee ordering within partitions
Ensure durability and fault tolerance through replication

Key Characteristics

Distributed - Runs on multiple servers (brokers)
Fault-tolerant - Replicates data across brokers
High-throughput - Millions of messages per second
Scalable - Add brokers/partitions to scale
Durable - Messages persisted to disk
Real-time - Low latency streaming

Core Concepts

1. Topics

A topic is a category or stream of messages in Kafka. Think of it as a named channel where producers publish messages and consumers subscribe to read them. Topics are similar to tables in a database or folders in a filesystem - they provide logical organization for related messages.

Key Characteristics:

Topics in Kafka are immutable logs - messages are appended to the end and never modified or deleted (until retention expires). This append-only design provides several benefits: it enables efficient sequential disk I/O, allows multiple consumers to read the same messages independently, and supports event replay for debugging or reprocessing.

Each topic is partitioned for scalability and replicated across multiple brokers for fault tolerance. Messages within a partition are strictly ordered, but ordering across partitions is not guaranteed unless you use a partitioning key.

Characteristics:

Immutable log - Messages appended, never modified
Partitioned - Split into partitions for parallelism
Replicated - Copies across brokers for reliability
Ordered - Messages ordered within partition

2. Partitions

A partition is an ordered, immutable sequence of messages within a topic. Each topic is divided into one or more partitions, which enables Kafka to scale horizontally and process messages in parallel.

Why Partitions Matter:

Partitions are Kafka’s fundamental unit of parallelism. Without partitions, a topic would be processed by only one consumer at a time, creating a bottleneck. With partitions, different consumers can process different partitions simultaneously, dramatically increasing throughput.

Partitioning Strategy:

When a producer sends a message, Kafka determines which partition to use based on:

Key-based partitioning: If a message has a key, Kafka uses a hash of the key to determine the partition. This ensures all messages with the same key go to the same partition, maintaining ordering for that key.
Round-robin partitioning: If no key is provided, Kafka distributes messages evenly across partitions in round-robin fashion.

Real-World Example: In an e-commerce system, you might partition the orders topic by customer_id. This ensures all orders from the same customer are processed in order, while orders from different customers can be processed in parallel.

Why Partitions?

Parallelism - Multiple consumers process different partitions
Scalability - Add partitions to scale throughput
Ordering - Messages ordered within partition (not globally)

Partitioning Strategy:

Key-based - Same key → same partition (ensures ordering)
Round-robin - Distribute evenly (no key)

3. Consumer Groups

A consumer group is a set of consumers that work together to consume messages from one or more topics. Kafka automatically distributes partitions across consumers in the same group, ensuring each partition is consumed by exactly one consumer in the group.

How Consumer Groups Work:

When consumers in a group subscribe to a topic, Kafka performs partition assignment - it divides the topic’s partitions among the available consumers. If you have 3 partitions and 2 consumers, Kafka might assign partitions 0 and 1 to consumer 1, and partition 2 to consumer 2.

Rebalancing:

When consumers join or leave a consumer group, Kafka automatically rebalances - it redistributes partitions among the remaining consumers. During rebalancing, consumers stop processing messages, which can cause a brief pause. This is why it’s important to design consumers to handle rebalancing gracefully and commit offsets frequently.

Scaling Pattern:

Scale consumption: Add more consumers to the group (up to the number of partitions)
Scale production: Add more partitions to the topic (requires careful planning as partitions can’t be reduced)

Key Rules:

One partition → one consumer (in same group)
One consumer → multiple partitions (can handle multiple)
Rebalancing - When consumer joins/leaves, partitions redistributed

Example:

Topic has 3 partitions
Consumer group has 2 consumers
Consumer 1 gets partitions 0, 1
Consumer 2 gets partition 2

4. Offsets

An offset is a sequential number that uniquely identifies the position of a message within a partition. Offsets start at 0 and increment for each message. They are immutable - once assigned, an offset never changes.

Offset Management:

Consumers track their current offset - the position of the last message they’ve successfully processed. After processing a message, the consumer commits the offset, which tells Kafka “I’ve processed up to this point.” If the consumer crashes and restarts, it resumes from the last committed offset, ensuring no messages are lost and no messages are processed twice (assuming proper offset management).

Offset Commit Strategies:

Automatic commit: Kafka commits offsets periodically (default: every 5 seconds). Simple but can lead to duplicate processing if consumer crashes between processing and commit.
Manual commit: Consumer explicitly commits offsets after processing. More control, but requires careful implementation to avoid duplicates or lost messages.

Offset Storage:

Kafka stores committed offsets in a special internal topic called __consumer_offsets. This allows Kafka to track consumer progress even if consumers restart or rebalance.

Offset Management:

Consumer tracks current offset
After processing, commits offset
On restart, resumes from committed offset

Kafka Architecture

Brokers and Clusters

Broker = Kafka server. Cluster = Multiple brokers.

Replication:

Leader - Handles reads/writes for partition
Followers - Replicate leader’s data
ISR (In-Sync Replicas) - Followers in sync with leader

Producer Implementation

Consumer Implementation

Exactly-Once Semantics

Exactly-once semantics ensures that each message is processed exactly once, with no duplicates and no lost messages. This is the strongest delivery guarantee but also the most complex to implement.

The Challenge

In distributed systems, achieving exactly-once processing is challenging because:

Network failures can cause retries, leading to duplicate messages
Consumer failures can cause rebalancing, leading to duplicate processing
Producer retries can send the same message multiple times
Offset commits can fail, causing messages to be reprocessed

The Solution: Three Components

Achieving exactly-once semantics requires coordination across three components:

Components

Idempotent Producer
- Prevents duplicate messages from retries
- Uses producer ID and sequence numbers
Transactional Producer
- Atomic writes across partitions
- Uses transactions
Idempotent Consumer
- Tracks processed offsets
- Deduplicates messages

Implementation

Key Takeaways

Partitions Enable Parallelism

Partitions allow multiple consumers to process topic in parallel. Ordering guaranteed per partition.

Consumer Groups Scale

Consumer groups distribute partitions across consumers. Add consumers to scale throughput.

Offsets Track Progress

Offsets track consumer position. Commit offsets to resume after restart. Critical for reliability.

Exactly-Once is Complex

Exactly-once requires idempotent producer, transactions, and idempotent consumer. Use when duplicates are critical.

Next Steps

Learn RabbitMQ - traditional message broker alternative
Master Event-Driven Architecture - event sourcing and CQRS
Understand Saga Pattern - distributed transaction management

Request a feature or report an issue