Partitions Enable Parallelism
Partitions allow multiple consumers to process topic in parallel. Ordering guaranteed per partition.
Apache Kafka is a distributed streaming platform designed for building real-time data pipelines and streaming applications. Originally developed by LinkedIn and open-sourced in 2011, Kafka has become the de facto standard for event streaming in modern distributed systems.
Kafka operates as a distributed commit log - a persistent, append-only data structure that stores streams of records (events) in topics. Unlike traditional message queues that delete messages after consumption, Kafka retains messages for a configurable retention period, allowing multiple consumers to read the same messages at different times and speeds.
This design makes Kafka ideal for scenarios where you need to:
A topic is a category or stream of messages in Kafka. Think of it as a named channel where producers publish messages and consumers subscribe to read them. Topics are similar to tables in a database or folders in a filesystem - they provide logical organization for related messages.
Key Characteristics:
Topics in Kafka are immutable logs - messages are appended to the end and never modified or deleted (until retention expires). This append-only design provides several benefits: it enables efficient sequential disk I/O, allows multiple consumers to read the same messages independently, and supports event replay for debugging or reprocessing.
Each topic is partitioned for scalability and replicated across multiple brokers for fault tolerance. Messages within a partition are strictly ordered, but ordering across partitions is not guaranteed unless you use a partitioning key.
Characteristics:
A partition is an ordered, immutable sequence of messages within a topic. Each topic is divided into one or more partitions, which enables Kafka to scale horizontally and process messages in parallel.
Why Partitions Matter:
Partitions are Kafka’s fundamental unit of parallelism. Without partitions, a topic would be processed by only one consumer at a time, creating a bottleneck. With partitions, different consumers can process different partitions simultaneously, dramatically increasing throughput.
Partitioning Strategy:
When a producer sends a message, Kafka determines which partition to use based on:
Real-World Example: In an e-commerce system, you might partition the orders topic by customer_id. This ensures all orders from the same customer are processed in order, while orders from different customers can be processed in parallel.
Why Partitions?
Partitioning Strategy:
A consumer group is a set of consumers that work together to consume messages from one or more topics. Kafka automatically distributes partitions across consumers in the same group, ensuring each partition is consumed by exactly one consumer in the group.
How Consumer Groups Work:
When consumers in a group subscribe to a topic, Kafka performs partition assignment - it divides the topic’s partitions among the available consumers. If you have 3 partitions and 2 consumers, Kafka might assign partitions 0 and 1 to consumer 1, and partition 2 to consumer 2.
Rebalancing:
When consumers join or leave a consumer group, Kafka automatically rebalances - it redistributes partitions among the remaining consumers. During rebalancing, consumers stop processing messages, which can cause a brief pause. This is why it’s important to design consumers to handle rebalancing gracefully and commit offsets frequently.
Scaling Pattern:
Key Rules:
Example:
An offset is a sequential number that uniquely identifies the position of a message within a partition. Offsets start at 0 and increment for each message. They are immutable - once assigned, an offset never changes.
Offset Management:
Consumers track their current offset - the position of the last message they’ve successfully processed. After processing a message, the consumer commits the offset, which tells Kafka “I’ve processed up to this point.” If the consumer crashes and restarts, it resumes from the last committed offset, ensuring no messages are lost and no messages are processed twice (assuming proper offset management).
Offset Commit Strategies:
Offset Storage:
Kafka stores committed offsets in a special internal topic called __consumer_offsets. This allows Kafka to track consumer progress even if consumers restart or rebalance.
Offset Management:
Broker = Kafka server. Cluster = Multiple brokers.
Replication:
Exactly-once semantics ensures that each message is processed exactly once, with no duplicates and no lost messages. This is the strongest delivery guarantee but also the most complex to implement.
In distributed systems, achieving exactly-once processing is challenging because:
Achieving exactly-once semantics requires coordination across three components:
Idempotent Producer
Transactional Producer
Idempotent Consumer
Partitions Enable Parallelism
Partitions allow multiple consumers to process topic in parallel. Ordering guaranteed per partition.
Consumer Groups Scale
Consumer groups distribute partitions across consumers. Add consumers to scale throughput.
Offsets Track Progress
Offsets track consumer position. Commit offsets to resume after restart. Critical for reliability.
Exactly-Once is Complex
Exactly-once requires idempotent producer, transactions, and idempotent consumer. Use when duplicates are critical.