Previous slide Next slide Toggle fullscreen Open presenter view
Kafka: The Commit Log
CS 6500 — Week 12, Session 1
CS 6500 — Big Data Analytics | Week 12
The 9 AM Problem
"A customer's credit card is compromised at 9 AM. Your fraud detection system runs as a nightly batch job at midnight. How many transactions does a thief complete before your system knows anything is wrong?"
CS 6500 — Big Data Analytics | Week 12
The Answer: A Log
Every batch system you've built — MapReduce, Spark, MongoDB, Cassandra — answers questions about data at rest . Fraud detection, live pricing, and real-time alerts require data in motion .
Today's session builds the answer piece by piece:
Why batch fails for latency-sensitive use cases
What the ideal event system looks like — and why it's a log
How Kafka partitions, replicates, and protects that log
How consumers read without destroying it
What happens when readers crash
After today you can:
Explain Kafka's architecture in terms of the log abstraction
Choose partition count, replication factor, and delivery semantic for a given scenario
Read a topic description from the CLI and know what every field means
By the end of this session, the 9 AM problem has a concrete answer.
CS 6500 — Big Data Analytics | Week 12
Week 11 Recap
Setting the stage for velocity
CS 6500 — Big Data Analytics | Week 12
Data at Rest
Every system we've built operates on stored data — files, documents, rows.
The stack so far:
System
Stores
HDFS / MapReduce
Batch files at scale
Spark
In-memory batch analytics
MongoDB
Flexible documents
Cassandra
High-write, query-first rows
All four share one constraint: data must stop moving before a pipeline can begin.
The missing dimension:
The 3 Vs of big data include Velocity — data arriving faster than a batch window can capture. None of these systems handle it.
Assignment 3 reminder: NoSQL Design + Query Federation is due Sunday — submit before streaming territory begins.
Today we eliminate the batch-window constraint.
CS 6500 — Big Data Analytics | Week 12
Why Streaming?
The limits of batch and the shape of the solution
CS 6500 — Big Data Analytics | Week 12
Batch Fails
Every MapReduce and Spark job runs on a closed window of data:
Batch timeline:
Events happen all day
Pipeline triggers at midnight
Results available at 1 AM
Data is 0–24 hours stale
Fine for:
Monthly billing summaries
Weekly recommendation updates
Overnight ETL to a warehouse
Fatal for:
Use Case
Cost of a 24-hour delay
Fraud detection
Card used 100× before alert
Ride-share pricing
Price is already wrong
IoT safety alert
Equipment already failed
Stock trading
Price already moved
Batch processing accumulates data over a window, then processes it once the window closes. The entire window must be complete before the pipeline can begin. For fraud detection, a thief has the full batch interval — potentially 24 hours — to exploit a compromised card before any alert fires. The only solution is to process each event the moment it arrives.
Every production system has a latency question: "How stale can our data be?" For fraud detection, alerting, and dynamic pricing, the answer is "milliseconds" — which rules out every batch architecture you've learned so far. Understanding why streaming exists makes you a better architect when this question lands on your desk.
CS 6500 — Big Data Analytics | Week 12
The Ideal System
If you were designing from scratch to solve the 9 AM problem, you'd want:
Accepts events the instant they arrive — no batch window to wait for
Remembers every event in order — any service can replay what happened
Lets any number of services tap independently — fraud detection and billing and analytics simultaneously
Survives broker failures — durability is non-negotiable for financial data
Scales to millions of events per second — LinkedIn, Uber, Netflix operate at this scale
That architecture is a log. An immutable, ordered, append-only sequence of records that any consumer can read from any point in history — without disturbing other readers.
Apache Kafka implements this architecture. Every design decision in Kafka's architecture follows from defending the log.
"The log" here is not an application log file — it's the computer science data structure: a totally-ordered, append-only sequence. Jay Kreps (one of Kafka's creators) wrote "The Log: What every software engineer should know about real-time data's unifying abstraction" — this insight is the foundation of Kafka's design. Every concept today (partitions, replication, offsets, consumer groups) is an answer to the question: how do you build and protect a distributed log?
Key Terms — Streaming Fundamentals
Data in motion — events processed as they arrive, before landing in persistent storage
Data at rest — events processed after being stored; batch window must close before processing begins
Distributed commit log — an append-only, ordered, replicated sequence of records; Kafka's core abstraction
Retention window — how long Kafka retains records regardless of whether consumers have read them
CS 6500 — Big Data Analytics | Week 12
The Log
What Kafka's architecture actually is
CS 6500 — Big Data Analytics | Week 12
Topics
A topic is Kafka's implementation of the log.
A topic is:
A named, ordered, immutable sequence of records
Append-only — you never update or delete record 42; you only append record 43
Retained for a configurable window (default: 7 days) regardless of consumption
Each record has a key , value (bytes), and timestamp
Kafka is not a queue.
Queue (RabbitMQ / SQS):
Consume record → record is deleted
Log shrinks
Kafka (log model):
Consume record → record stays
Log only grows
Any consumer can read from any offset, any time. Replaying history is free.
The "read head on a tape" analogy captures the model. A consumer at offset 500 has processed records 0–499. On crash and restart, it resumes at 500. The broker never pushes records — consumers always poll, so each consumer controls its own pace and backpressure naturally. This is what makes adding new consumers safe: the original records are always there.
CS 6500 — Big Data Analytics | Week 12
Partitions
A single log would bottleneck at one machine. Partitions scale the log across the cluster.
Topic "orders" (3 partitions)
Partition 0: [0] [1] [2] [3] ────────────────── newest
Partition 1: [0] [1] [2] ────────────────── newest
Partition 2: [0] [1] [2] [3] [4] ───────────── newest
Each partition is its own independent log stored on one broker's disk
Within a partition: records are strictly ordered by offset
Across partitions: no global ordering guarantee
Producer key determines partition: hash(key) % num_partitions
Same key → same partition → ordering guaranteed for that key
Partition count is a high-stakes, low-flexibility decision. Increasing partitions later changes the key-to-partition mapping, breaking ordering guarantees for migrated keys. Most teams provision for expected peak consumer count plus 20% headroom. Under-partitioning is far more common than over-partitioning in production systems.
The partition key rule (hash(key) % num_partitions) is how Kafka implements per-entity ordering. If your use case requires "process all events for user X in sequence" — a financial ledger, a user activity stream, a sensor telemetry feed — the partition key is not optional. Choosing the wrong key loses the ordering guarantee silently, with no error.
CS 6500 — Big Data Analytics | Week 12
Partitions
Design consequence: partition count = your future peak parallelism ceiling. Plan for 18 months from now, not today.
Partition count is a high-stakes, low-flexibility decision. Increasing partitions later changes the key-to-partition mapping, breaking ordering guarantees for migrated keys. Most teams provision for expected peak consumer count plus 20% headroom. Under-partitioning is far more common than over-partitioning in production systems.
The partition key rule (hash(key) % num_partitions) is how Kafka implements per-entity ordering. If your use case requires "process all events for user X in sequence" — a financial ledger, a user activity stream, a sensor telemetry feed — the partition key is not optional. Choosing the wrong key loses the ordering guarantee silently, with no error.
CS 6500 — Big Data Analytics | Week 12
Demo: Create Topic
docker compose up -d zookeeper kafka-broker
docker exec -it kafka-broker kafka-topics \
--bootstrap-server localhost:9092 --list
docker exec -it kafka-broker kafka-topics \
--bootstrap-server localhost:9092 \
--create --topic orders \
--partitions 3 --replication-factor 1
Key Terms — The Log
Topic — a named, ordered, immutable log; Kafka's primary abstraction
Partition — a sub-log of a topic; the unit of parallelism; stored on one broker
Offset — the integer index of a record within a partition; how consumers track position
Record — key + value + timestamp; the unit of data in Kafka
Producer — writes records to a topic; routes via key hash or round-robin
CS 6500 — Big Data Analytics | Week 12
Demo: Create Topic
docker exec -it kafka-broker kafka-topics \
--bootstrap-server localhost:9092 \
--describe --topic orders
In the --describe output:
Leader: 0 — Broker 0 handles all reads/writes for this partition
Replicas: 0 — only one broker in dev; production shows multiple
Isr: 0 — in-sync replicas (should equal Replicas in a healthy cluster)
Key Terms — The Log
Topic — a named, ordered, immutable log; Kafka's primary abstraction
Partition — a sub-log of a topic; the unit of parallelism; stored on one broker
Offset — the integer index of a record within a partition; how consumers track position
Record — key + value + timestamp; the unit of data in Kafka
Producer — writes records to a topic; routes via key hash or round-robin
CS 6500 — Big Data Analytics | Week 12
Protecting the Log
Replication, ISR, and producer durability
CS 6500 — Big Data Analytics | Week 12
Replication
The log is useless if a broker failure destroys it. Replication is the answer.
Kafka's replication model:
Each partition has 1 leader + (RF − 1) followers on different brokers
All reads and writes go through the leader
Followers copy the leader's log continuously
ISR (In-Sync Replicas): followers within a configurable lag threshold
What failure looks like (RF = 3):
Partition 0:
Broker 0 — Leader
Broker 1 — Follower (ISR)
Broker 2 — Follower (ISR)
Broker 0 fails:
ZooKeeper/KRaft detects (~5 s)
Broker 1 or 2 elected new leader
No data lost; cluster resumes
RF=3 survives one broker failure with no interruption.
ISR membership is dynamic. If a follower falls behind (network issue, slow disk), it is removed from the ISR. If min.insync.replicas=2 and only 1 ISR replica remains, Kafka refuses new writes — trading availability for durability. This is a deliberate design choice: Kafka prefers to stop accepting writes rather than risk data loss.
CS 6500 — Big Data Analytics | Week 12
Producer Durability
Replication exists — but the producer controls how much of it to wait for before proceeding.
acks
Acknowledged by
Risk
Throughput
acks=0
Nobody — fire and forget
Data loss on any failure
Maximum
acks=1
Leader only
Loss if leader dies before replication
High
acks=all
All ISR replicas
None within ISR
Lower
Rule of thumb: acks=all for billing and audit logs; acks=1 for metrics and clickstream.
Common misconception: acks=all is not a synchronous disk flush. Kafka's durability relies on distributing replicas across different racks and AZs — not on fsync. A simultaneous power failure on all ISR brokers would still lose data; that scenario is practically eliminated by rack placement.
The producer's acks setting is a client-side decision — per-producer, not per-topic. A billing service can use acks=all; a metrics pipeline can use acks=1. In practice, acks=all with min.insync.replicas=2 and RF=3 is the production-safe default for any data you cannot afford to lose.
Key Terms — Protecting the Log
Replication factor (RF) — how many broker copies each partition maintains; RF=3 is the production standard
Leader — the single broker handling all reads and writes for a partition
Follower — a broker replica that copies the leader's log; used for failover
ISR (In-Sync Replicas) — replicas confirmed as up-to-date with the leader; only ISR members are eligible for leader election
acks — producer durability setting: 0 = fire-and-forget, 1 = leader only, all = full ISR acknowledgment
CS 6500 — Big Data Analytics | Week 12
Reading the Log
Offsets, consumers, and the fan-out model
CS 6500 — Big Data Analytics | Week 12
The Offset
The offset is how consumers navigate the log without destroying it.
Offset mechanics:
An offset is the integer index of a record within one partition
Consumers commit their last-read offset; on restart, they resume from that point
Stored in the internal __consumer_offsets topic
Replay is free: reset offset to 0 and reprocess the entire history
Offset vs. queue:
Queue (RabbitMQ / SQS):
Consume → message deleted
Two consumers compete
Replay: impossible
Kafka:
Consume → record stays
Each consumer group tracks own offset
Replay: just reset the offset
A new service deployed months later can replay from day one — no backfill needed.
The ability to add a new downstream consumer and give it access to historical events is one of Kafka's most underrated features. A team deploying a new ML model in month 6 can train on all events from month 1. The data was always there; Kafka never deleted it. This is the log abstraction paying dividends beyond real-time processing.
CS 6500 — Big Data Analytics | Week 12
Consumer Groups
A consumer group lets multiple processes share the work of reading a topic.
Topic "orders" — 3 partitions
Consumer Group "billing-service":
Partition 0 ── Consumer A
Partition 1 ── Consumer B
Partition 2 ── Consumer C
Each partition assigned to at most one consumer in the group at a time
If consumers > partitions: extra consumers are idle
Consumer crashes → Kafka rebalances (~10 s) — survivor inherits its partitions
Hard limit: partition count = maximum effective parallelism for a single group
Plan partition count for your future peak consumer count, not today's.
Rebalancing is automatic but not free. During a rebalance, all consumers in the group pause while Kafka redistributes partition assignments. This is acceptable for most workloads but becomes a design constraint in auto-scaling environments where consumers spin up and down frequently — each spin-up triggers a group-wide pause.
CS 6500 — Big Data Analytics | Week 12
Fan-Out
Multiple consumer groups tap the same topic independently — each with its own offset cursor.
Topic "clickstream"
Group "real-time-dashboard" → offset 1,247 (fast — real-time UI)
Group "ml-trainer" → offset 892 (slow — large batch retraining)
Group "fraud-detector" → offset 1,251 (fastest — per-event alerts)
A slow group never blocks a fast group. Each advances at its own pace.
Adding a new consumer group requires:
Zero changes to producers
Zero changes to existing groups
Zero schema changes to the topic
Compare to PostgreSQL: a multi-consumer tracking table, polling queries, and cross-team coordination.
Fan-out enables event-driven microservices: one team publishes events; any number of teams subscribe without coordination. The dashboard, ML, and fraud teams each own their consumer group and can be deployed, scaled, and restarted independently. This decoupling is the architectural reason Kafka appears in every major microservices stack.
The fan-out model changes how you design distributed systems. Instead of point-to-point integrations, you publish events once and let any service subscribe. New downstream services require zero upstream coordination. This is why LinkedIn built Kafka in 2011 — they needed a single event backbone that every internal service could tap without coupling to each other.
Key Terms — Reading the Log
Consumer — reads records from a topic starting at a given offset; consuming does not delete records
Consumer group — a named set of consumers sharing partitions; each partition goes to exactly one member
Offset commit — recording the consumer's current position so it can resume after restart or crash
Rebalance — Kafka's process of reassigning partitions when group membership changes; briefly pauses consumption
Fan-out — multiple consumer groups reading the same topic independently at their own pace
CS 6500 — Big Data Analytics | Week 12
When Readers Crash
Delivery semantics and the offset commit contract
CS 6500 — Big Data Analytics | Week 12
Offset Management
The gap between "processed" and "committed" is where failures hide.
Auto-commit (enable.auto.commit=true)
Commits offset every 5 seconds automatically
Simple — zero application code
Crash after commit, before processing → message skipped → data loss
Crash after processing, before commit → message redelivered → duplicate
Which failure you get depends on when the crash happens
Manual commit (enable.auto.commit=false)
Application calls consumer.commit() after confirming successful processing
Deterministic: if you don't commit, you'll see the message again
Enables batch commits: process 100 messages, then commit once
Combined with idempotent downstream writes → safe for most production use cases
Session 2: you'll simulate a crash and watch this in action.
Auto-commit's failure modes depend on crash timing, making them unpredictable in production. Manual commit converts this uncertainty into a deterministic guarantee: at-least-once delivery. The correct pattern: process message → write result to an idempotent store (INSERT ... ON CONFLICT DO NOTHING with event_id) → commit offset. A reprocessed duplicate event has no effect on the final state.
CS 6500 — Big Data Analytics | Week 12
Delivery Semantics
Semantic
Mechanism
Risk
Best For
At-most-once
Commit before processing
Data loss on crash
Non-critical metrics, logs
At-least-once
Commit after processing
Duplicate messages
Most pipelines — use idempotent handlers
Exactly-once
Kafka transactions + idempotent producers
Highest complexity and latency
Financial transactions, billing
Practical reality:
At-least-once + idempotent handler (upsert by event_id) covers ~80% of production use cases
Exactly-once adds ~20% latency overhead; Kafka Transactions only cover the Kafka-to-Kafka hop
External systems (PostgreSQL, MongoDB) still require idempotency regardless of Kafka's EOS setting
"Exactly-once" in Kafka is technically possible via the Kafka Transactions API — an atomic multi-partition write and consumer offset commit. But this covers only the Kafka-to-Kafka hop. If your consumer writes to PostgreSQL, you're back to needing idempotency at the database layer. True end-to-end exactly-once remains an application design problem, not a Kafka configuration option.
Key Terms — Delivery Semantics
At-most-once — commit before processing; fast but messages can be lost on crash
At-least-once — commit after processing; safe default; requires idempotent downstream writes to handle duplicates
Exactly-once semantics (EOS) — Kafka Transactions API; atomic multi-partition commit; high complexity; covers only the Kafka hop
Idempotent handler — a downstream write that produces the same result regardless of how many times it's applied; the practical solution to at-least-once duplicates
CS 6500 — Big Data Analytics | Week 12
Design Activity
Apply the full log model to one scenario
CS 6500 — Big Data Analytics | Week 12
Design Exercise
Work in pairs — 5 minutes.
A ride-share app processes GPS location updates from 10,000 active drivers, each sending 1 update/second (10,000 events/sec). Two services consume this stream:
Map display: shows live driver positions to passengers — real-time; tolerates occasional duplicates
Billing: calculates fare from distance traveled — must not miss events; duplicates create double-charges
Design the Kafka topology. Give an answer and a reason for each:
Topic name?
Partition count? (plan for peak consumer count of each service)
Partition key? (what ordering guarantee do you need?)
Replication factor? (this is production data)
Delivery semantic for billing? For map display?
CS 6500 — Big Data Analytics | Week 12
Solution
Design Item
Answer
Why
Topic name
driver_location_updates
Clear, domain-specific name
Partition count
20
Headroom for scaling; hard limit on group parallelism
Partition key
driver_id
Keeps all updates for one driver in order on one partition
Replication factor
3
Survives single-broker failure; production standard
Billing semantic
At-least-once + idempotent write
Never miss a trip segment; dedupe by event_id
Map display semantic
At-most-once or at-least-once
UI tolerates an occasional skipped or duplicate position
Key trade-off: driver_id distributes load by active driver count — hotspots unlikely. Key = city_id would collapse all NYC drivers into one partition, destroying per-driver ordering and creating a hotspot.
CS 6500 — Big Data Analytics | Week 12
Key Takeaways
Everything today follows from one design decision: Kafka is an immutable, distributed log.
The log:
Batch fails when latency must be ≤ seconds — streaming is the only answer
A topic is an ordered, append-only, durable record of every event
Partitions scale the log; same key → same partition → per-key ordering
Protecting the log:
Replication across brokers; ISR = replicas fully caught up with the leader
acks=all + RF=3 → no data loss on a single broker failure
Reading the log:
Offset is the read head; consuming a record does not delete it
Consumer groups assign each partition to one consumer — partition count = parallelism ceiling
Multiple groups tap the same log independently at their own pace
Crash recovery:
Auto-commit → unpredictable delivery depending on crash timing
Manual commit → at-least-once (the production default)
Exactly-once → available but expensive; at-least-once + idempotency achieves the same outcome
Session 2: build the Python producer and consumer; simulate failures; watch rebalancing live.
CS 6500 — Big Data Analytics | Week 12
What's Missing?
Kafka stores and delivers the log — but it wasn't built for everything
CS 6500 — Big Data Analytics | Week 12
The Gaps
No stream computations — Kafka stores raw events; windowed averages, running totals, and stream-stream joins require a processing layer on top — Kafka has no engine to compute them
No random access — you cannot efficiently query "all events for user_123 in the last hour" without scanning an entire partition; Kafka is optimized for sequential consumption, not arbitrary lookups
No stateful processing — Kafka delivers the events; tracking running totals, session windows, or anomaly scores across events requires an external stateful processor
Exactly-once end-to-end is hard — Kafka Transactions cover only the Kafka hop; true end-to-end correctness across external systems still requires idempotency at the application layer
CS 6500 — Big Data Analytics | Week 12
What Comes Next
Gap
Solution
When
Stream aggregations and windowed joins
Spark Structured Streaming (readStream.format("kafka"))
Week 13
Stateful operators — running totals, session windows
Spark Structured Streaming — groupBy, window, watermark
Week 13
Fault-tolerant streaming state
Spark checkpointing — restart without replaying the world
Week 13
Kafka is the right tool for durable, high-throughput event ingestion and fan-out — but the raw log needs a processing layer to become answers.
CS 6500 — Big Data Analytics | Week 12
4 minutes. Ask: "What partition key did you use for your Cassandra schema?" Transition: all four systems handle data at rest — today we add velocity.
Ask: "Which failure causes more real-world damage — a delayed recommendation or a delayed fraud alert?" The fraud case makes the cost visceral.
This slide is the pivot. The log isn't a Kafka feature — it's the solution shape that Kafka realizes. Everything else today follows from: how do we build and protect a distributed log?
Draw on the whiteboard: a tape that only grows to the right. Each consumer group has its own read head. The tape doesn't shrink when a head moves forward. This analogy will carry through the whole session.
Draw this. The key insight: ordering is per-partition, not per-topic. "user_123" events always land in the same partition — they stay in order. Two different users may interleave across partitions, but each user's history is intact.
Draw this. The key insight: ordering is per-partition, not per-topic. "user_123" events always land in the same partition — they stay in order. Two different users may interleave across partitions, but each user's history is intact.
5–7 minutes. Run this live. Single-broker dev: Leader = 0 for every partition. In a 3-broker cluster, leaders spread automatically — show students what that would look like.
5–7 minutes. Run this live. Single-broker dev: Leader = 0 for every partition. In a 3-broker cluster, leaders spread automatically — show students what that would look like.
This matters: acks=all means "all ISR replicas received the record in memory" — not "written to disk." Kafka bets that correlated hardware failure across racks is so unlikely that distribution provides practical durability.
"Idle consumer" is important. More consumers ≠ more throughput beyond the partition count. Emphasize: to use N consumers effectively, you need N partitions.
Fan-out is what makes Kafka the backbone of event-driven microservices. Each team owns its consumer group. New services subscribe without touching existing ones.
Ask: "For a billing system — which failure is worse: charging twice or not charging at all?" Both are bad. At-least-once + idempotent upsert is the practical answer for ~80% of use cases.
"Why not always exactly-once?" → it's a distributed transaction. Latency + complexity is non-trivial. At-least-once + idempotency achieves the same outcome with much less complexity.
5 minutes pair work, 3 minutes debrief. Expected: key=driver_id; ~20 partitions; RF=3; billing=at-least-once + idempotent fare calc; map display=at-most-once or at-least-once both fine.
Debrief focus: key=city_id → hotspots + no per-driver ordering. 3 partitions → future scaling blocked without repartitioning, which breaks ordering.