Kafka: The Commit Log

CS 6500 — Week 12, Session 1

CS 6500 — Big Data Analytics | Week 12

The 9 AM Problem

"A customer's credit card is compromised at 9 AM. Your fraud detection system runs as a nightly batch job at midnight. How many transactions does a thief complete before your system knows anything is wrong?"

CS 6500 — Big Data Analytics | Week 12

The Answer: A Log

Every batch system you've built — MapReduce, Spark, MongoDB, Cassandra — answers questions about data at rest. Fraud detection, live pricing, and real-time alerts require data in motion.

Today's session builds the answer piece by piece:

  1. Why batch fails for latency-sensitive use cases
  2. What the ideal event system looks like — and why it's a log
  3. How Kafka partitions, replicates, and protects that log
  4. How consumers read without destroying it
  5. What happens when readers crash

After today you can:

  • Explain Kafka's architecture in terms of the log abstraction
  • Choose partition count, replication factor, and delivery semantic for a given scenario
  • Read a topic description from the CLI and know what every field means

By the end of this session, the 9 AM problem has a concrete answer.

CS 6500 — Big Data Analytics | Week 12

Week 11 Recap

Setting the stage for velocity

CS 6500 — Big Data Analytics | Week 12

Data at Rest

Every system we've built operates on stored data — files, documents, rows.

The stack so far:

System Stores
HDFS / MapReduce Batch files at scale
Spark In-memory batch analytics
MongoDB Flexible documents
Cassandra High-write, query-first rows

All four share one constraint: data must stop moving before a pipeline can begin.

The missing dimension:

The 3 Vs of big data include Velocity — data arriving faster than a batch window can capture. None of these systems handle it.

Assignment 3 reminder: NoSQL Design + Query Federation is due Sunday — submit before streaming territory begins.

Today we eliminate the batch-window constraint.

CS 6500 — Big Data Analytics | Week 12

Why Streaming?

The limits of batch and the shape of the solution

CS 6500 — Big Data Analytics | Week 12

Batch Fails

Every MapReduce and Spark job runs on a closed window of data:

Batch timeline:

  1. Events happen all day
  2. Pipeline triggers at midnight
  3. Results available at 1 AM
  4. Data is 0–24 hours stale

Fine for:

  • Monthly billing summaries
  • Weekly recommendation updates
  • Overnight ETL to a warehouse

Fatal for:

Use Case Cost of a 24-hour delay
Fraud detection Card used 100× before alert
Ride-share pricing Price is already wrong
IoT safety alert Equipment already failed
Stock trading Price already moved

Batch processing accumulates data over a window, then processes it once the window closes. The entire window must be complete before the pipeline can begin. For fraud detection, a thief has the full batch interval — potentially 24 hours — to exploit a compromised card before any alert fires. The only solution is to process each event the moment it arrives.

Every production system has a latency question: "How stale can our data be?" For fraud detection, alerting, and dynamic pricing, the answer is "milliseconds" — which rules out every batch architecture you've learned so far. Understanding why streaming exists makes you a better architect when this question lands on your desk.

CS 6500 — Big Data Analytics | Week 12

The Ideal System

If you were designing from scratch to solve the 9 AM problem, you'd want:

  • Accepts events the instant they arrive — no batch window to wait for
  • Remembers every event in order — any service can replay what happened
  • Lets any number of services tap independently — fraud detection and billing and analytics simultaneously
  • Survives broker failures — durability is non-negotiable for financial data
  • Scales to millions of events per second — LinkedIn, Uber, Netflix operate at this scale

That architecture is a log. An immutable, ordered, append-only sequence of records that any consumer can read from any point in history — without disturbing other readers.

Apache Kafka implements this architecture. Every design decision in Kafka's architecture follows from defending the log.

"The log" here is not an application log file — it's the computer science data structure: a totally-ordered, append-only sequence. Jay Kreps (one of Kafka's creators) wrote "The Log: What every software engineer should know about real-time data's unifying abstraction" — this insight is the foundation of Kafka's design. Every concept today (partitions, replication, offsets, consumer groups) is an answer to the question: how do you build and protect a distributed log?

Key Terms — Streaming Fundamentals

  • Data in motion — events processed as they arrive, before landing in persistent storage
  • Data at rest — events processed after being stored; batch window must close before processing begins
  • Distributed commit log — an append-only, ordered, replicated sequence of records; Kafka's core abstraction
  • Retention window — how long Kafka retains records regardless of whether consumers have read them
CS 6500 — Big Data Analytics | Week 12

The Log

What Kafka's architecture actually is

CS 6500 — Big Data Analytics | Week 12

Topics

A topic is Kafka's implementation of the log.

A topic is:

  • A named, ordered, immutable sequence of records
  • Append-only — you never update or delete record 42; you only append record 43
  • Retained for a configurable window (default: 7 days) regardless of consumption
  • Each record has a key, value (bytes), and timestamp

Kafka is not a queue.

Queue (RabbitMQ / SQS):
  Consume record → record is deleted
  Log shrinks

Kafka (log model):
  Consume record → record stays
  Log only grows

Any consumer can read from any offset, any time. Replaying history is free.

The "read head on a tape" analogy captures the model. A consumer at offset 500 has processed records 0–499. On crash and restart, it resumes at 500. The broker never pushes records — consumers always poll, so each consumer controls its own pace and backpressure naturally. This is what makes adding new consumers safe: the original records are always there.

CS 6500 — Big Data Analytics | Week 12

Partitions

A single log would bottleneck at one machine. Partitions scale the log across the cluster.

Topic "orders"  (3 partitions)

Partition 0:  [0] [1] [2] [3] ──────────────────▶ newest
Partition 1:  [0] [1] [2]    ──────────────────▶ newest
Partition 2:  [0] [1] [2] [3] [4] ─────────────▶ newest
  • Each partition is its own independent log stored on one broker's disk
  • Within a partition: records are strictly ordered by offset
  • Across partitions: no global ordering guarantee
  • Producer key determines partition: hash(key) % num_partitions
  • Same key → same partition → ordering guaranteed for that key

Partition count is a high-stakes, low-flexibility decision. Increasing partitions later changes the key-to-partition mapping, breaking ordering guarantees for migrated keys. Most teams provision for expected peak consumer count plus 20% headroom. Under-partitioning is far more common than over-partitioning in production systems.

The partition key rule (hash(key) % num_partitions) is how Kafka implements per-entity ordering. If your use case requires "process all events for user X in sequence" — a financial ledger, a user activity stream, a sensor telemetry feed — the partition key is not optional. Choosing the wrong key loses the ordering guarantee silently, with no error.

CS 6500 — Big Data Analytics | Week 12

Partitions

Design consequence: partition count = your future peak parallelism ceiling. Plan for 18 months from now, not today.

Partition count is a high-stakes, low-flexibility decision. Increasing partitions later changes the key-to-partition mapping, breaking ordering guarantees for migrated keys. Most teams provision for expected peak consumer count plus 20% headroom. Under-partitioning is far more common than over-partitioning in production systems.

The partition key rule (hash(key) % num_partitions) is how Kafka implements per-entity ordering. If your use case requires "process all events for user X in sequence" — a financial ledger, a user activity stream, a sensor telemetry feed — the partition key is not optional. Choosing the wrong key loses the ordering guarantee silently, with no error.

CS 6500 — Big Data Analytics | Week 12

Demo: Create Topic

# Start Kafka + ZooKeeper
docker compose up -d zookeeper kafka-broker

# Verify Kafka is healthy
docker exec -it kafka-broker kafka-topics \
  --bootstrap-server localhost:9092 --list

# Create topic: 3 partitions, replication factor 1 (dev single-broker)
docker exec -it kafka-broker kafka-topics \
  --bootstrap-server localhost:9092 \
  --create --topic orders \
  --partitions 3 --replication-factor 1


Key Terms — The Log

  • Topic — a named, ordered, immutable log; Kafka's primary abstraction
  • Partition — a sub-log of a topic; the unit of parallelism; stored on one broker
  • Offset — the integer index of a record within a partition; how consumers track position
  • Record — key + value + timestamp; the unit of data in Kafka
  • Producer — writes records to a topic; routes via key hash or round-robin
CS 6500 — Big Data Analytics | Week 12

Demo: Create Topic


# Describe: shows Leader, Replicas, Isr for each partition
docker exec -it kafka-broker kafka-topics \
  --bootstrap-server localhost:9092 \
  --describe --topic orders

In the --describe output:

  • Leader: 0 — Broker 0 handles all reads/writes for this partition
  • Replicas: 0 — only one broker in dev; production shows multiple
  • Isr: 0 — in-sync replicas (should equal Replicas in a healthy cluster)

Key Terms — The Log

  • Topic — a named, ordered, immutable log; Kafka's primary abstraction
  • Partition — a sub-log of a topic; the unit of parallelism; stored on one broker
  • Offset — the integer index of a record within a partition; how consumers track position
  • Record — key + value + timestamp; the unit of data in Kafka
  • Producer — writes records to a topic; routes via key hash or round-robin
CS 6500 — Big Data Analytics | Week 12

Protecting the Log

Replication, ISR, and producer durability

CS 6500 — Big Data Analytics | Week 12

Replication

The log is useless if a broker failure destroys it. Replication is the answer.

Kafka's replication model:

  • Each partition has 1 leader + (RF − 1) followers on different brokers
  • All reads and writes go through the leader
  • Followers copy the leader's log continuously
  • ISR (In-Sync Replicas): followers within a configurable lag threshold

What failure looks like (RF = 3):

Partition 0:
  Broker 0 — Leader
  Broker 1 — Follower (ISR)
  Broker 2 — Follower (ISR)

Broker 0 fails:
  ZooKeeper/KRaft detects (~5 s)
  Broker 1 or 2 elected new leader
  No data lost; cluster resumes

RF=3 survives one broker failure with no interruption.

ISR membership is dynamic. If a follower falls behind (network issue, slow disk), it is removed from the ISR. If min.insync.replicas=2 and only 1 ISR replica remains, Kafka refuses new writes — trading availability for durability. This is a deliberate design choice: Kafka prefers to stop accepting writes rather than risk data loss.

CS 6500 — Big Data Analytics | Week 12

Producer Durability

Replication exists — but the producer controls how much of it to wait for before proceeding.

acks Acknowledged by Risk Throughput
acks=0 Nobody — fire and forget Data loss on any failure Maximum
acks=1 Leader only Loss if leader dies before replication High
acks=all All ISR replicas None within ISR Lower

Rule of thumb: acks=all for billing and audit logs; acks=1 for metrics and clickstream.

Common misconception: acks=all is not a synchronous disk flush. Kafka's durability relies on distributing replicas across different racks and AZs — not on fsync. A simultaneous power failure on all ISR brokers would still lose data; that scenario is practically eliminated by rack placement.

The producer's acks setting is a client-side decision — per-producer, not per-topic. A billing service can use acks=all; a metrics pipeline can use acks=1. In practice, acks=all with min.insync.replicas=2 and RF=3 is the production-safe default for any data you cannot afford to lose.

Key Terms — Protecting the Log

  • Replication factor (RF) — how many broker copies each partition maintains; RF=3 is the production standard
  • Leader — the single broker handling all reads and writes for a partition
  • Follower — a broker replica that copies the leader's log; used for failover
  • ISR (In-Sync Replicas) — replicas confirmed as up-to-date with the leader; only ISR members are eligible for leader election
  • acks — producer durability setting: 0 = fire-and-forget, 1 = leader only, all = full ISR acknowledgment
CS 6500 — Big Data Analytics | Week 12

Reading the Log

Offsets, consumers, and the fan-out model

CS 6500 — Big Data Analytics | Week 12

The Offset

The offset is how consumers navigate the log without destroying it.

Offset mechanics:

  • An offset is the integer index of a record within one partition
  • Consumers commit their last-read offset; on restart, they resume from that point
  • Stored in the internal __consumer_offsets topic
  • Replay is free: reset offset to 0 and reprocess the entire history

Offset vs. queue:

Queue (RabbitMQ / SQS):
  Consume → message deleted
  Two consumers compete
  Replay: impossible

Kafka:
  Consume → record stays
  Each consumer group tracks own offset
  Replay: just reset the offset

A new service deployed months later can replay from day one — no backfill needed.

The ability to add a new downstream consumer and give it access to historical events is one of Kafka's most underrated features. A team deploying a new ML model in month 6 can train on all events from month 1. The data was always there; Kafka never deleted it. This is the log abstraction paying dividends beyond real-time processing.

CS 6500 — Big Data Analytics | Week 12

Consumer Groups

A consumer group lets multiple processes share the work of reading a topic.

Topic "orders" — 3 partitions

Consumer Group "billing-service":
  Partition 0 ──▶ Consumer A
  Partition 1 ──▶ Consumer B
  Partition 2 ──▶ Consumer C
  • Each partition assigned to at most one consumer in the group at a time
  • If consumers > partitions: extra consumers are idle
  • Consumer crashes → Kafka rebalances (~10 s) — survivor inherits its partitions
  • Hard limit: partition count = maximum effective parallelism for a single group

Plan partition count for your future peak consumer count, not today's.

Rebalancing is automatic but not free. During a rebalance, all consumers in the group pause while Kafka redistributes partition assignments. This is acceptable for most workloads but becomes a design constraint in auto-scaling environments where consumers spin up and down frequently — each spin-up triggers a group-wide pause.

CS 6500 — Big Data Analytics | Week 12

Fan-Out

Multiple consumer groups tap the same topic independently — each with its own offset cursor.

Topic "clickstream"

Group "real-time-dashboard"  → offset 1,247  (fast — real-time UI)
Group "ml-trainer"           → offset   892  (slow — large batch retraining)
Group "fraud-detector"       → offset 1,251  (fastest — per-event alerts)

A slow group never blocks a fast group. Each advances at its own pace.

Adding a new consumer group requires:

  • Zero changes to producers
  • Zero changes to existing groups
  • Zero schema changes to the topic

Compare to PostgreSQL: a multi-consumer tracking table, polling queries, and cross-team coordination.

Fan-out enables event-driven microservices: one team publishes events; any number of teams subscribe without coordination. The dashboard, ML, and fraud teams each own their consumer group and can be deployed, scaled, and restarted independently. This decoupling is the architectural reason Kafka appears in every major microservices stack.

The fan-out model changes how you design distributed systems. Instead of point-to-point integrations, you publish events once and let any service subscribe. New downstream services require zero upstream coordination. This is why LinkedIn built Kafka in 2011 — they needed a single event backbone that every internal service could tap without coupling to each other.

Key Terms — Reading the Log

  • Consumer — reads records from a topic starting at a given offset; consuming does not delete records
  • Consumer group — a named set of consumers sharing partitions; each partition goes to exactly one member
  • Offset commit — recording the consumer's current position so it can resume after restart or crash
  • Rebalance — Kafka's process of reassigning partitions when group membership changes; briefly pauses consumption
  • Fan-out — multiple consumer groups reading the same topic independently at their own pace
CS 6500 — Big Data Analytics | Week 12

When Readers Crash

Delivery semantics and the offset commit contract

CS 6500 — Big Data Analytics | Week 12

Offset Management

The gap between "processed" and "committed" is where failures hide.

Auto-commit (enable.auto.commit=true)

  • Commits offset every 5 seconds automatically
  • Simple — zero application code
  • Crash after commit, before processing → message skipped → data loss
  • Crash after processing, before commit → message redelivered → duplicate
  • Which failure you get depends on when the crash happens

Manual commit (enable.auto.commit=false)

  • Application calls consumer.commit() after confirming successful processing
  • Deterministic: if you don't commit, you'll see the message again
  • Enables batch commits: process 100 messages, then commit once
  • Combined with idempotent downstream writes → safe for most production use cases

Session 2: you'll simulate a crash and watch this in action.

Auto-commit's failure modes depend on crash timing, making them unpredictable in production. Manual commit converts this uncertainty into a deterministic guarantee: at-least-once delivery. The correct pattern: process message → write result to an idempotent store (INSERT ... ON CONFLICT DO NOTHING with event_id) → commit offset. A reprocessed duplicate event has no effect on the final state.

CS 6500 — Big Data Analytics | Week 12

Delivery Semantics

Semantic Mechanism Risk Best For
At-most-once Commit before processing Data loss on crash Non-critical metrics, logs
At-least-once Commit after processing Duplicate messages Most pipelines — use idempotent handlers
Exactly-once Kafka transactions + idempotent producers Highest complexity and latency Financial transactions, billing

Practical reality:

  • At-least-once + idempotent handler (upsert by event_id) covers ~80% of production use cases
  • Exactly-once adds ~20% latency overhead; Kafka Transactions only cover the Kafka-to-Kafka hop
  • External systems (PostgreSQL, MongoDB) still require idempotency regardless of Kafka's EOS setting

"Exactly-once" in Kafka is technically possible via the Kafka Transactions API — an atomic multi-partition write and consumer offset commit. But this covers only the Kafka-to-Kafka hop. If your consumer writes to PostgreSQL, you're back to needing idempotency at the database layer. True end-to-end exactly-once remains an application design problem, not a Kafka configuration option.

Key Terms — Delivery Semantics

  • At-most-once — commit before processing; fast but messages can be lost on crash
  • At-least-once — commit after processing; safe default; requires idempotent downstream writes to handle duplicates
  • Exactly-once semantics (EOS) — Kafka Transactions API; atomic multi-partition commit; high complexity; covers only the Kafka hop
  • Idempotent handler — a downstream write that produces the same result regardless of how many times it's applied; the practical solution to at-least-once duplicates
CS 6500 — Big Data Analytics | Week 12

Design Activity

Apply the full log model to one scenario

CS 6500 — Big Data Analytics | Week 12

Design Exercise

Work in pairs — 5 minutes.

A ride-share app processes GPS location updates from 10,000 active drivers, each sending 1 update/second (10,000 events/sec). Two services consume this stream:

  • Map display: shows live driver positions to passengers — real-time; tolerates occasional duplicates
  • Billing: calculates fare from distance traveled — must not miss events; duplicates create double-charges

Design the Kafka topology. Give an answer and a reason for each:

  1. Topic name?
  2. Partition count? (plan for peak consumer count of each service)
  3. Partition key? (what ordering guarantee do you need?)
  4. Replication factor? (this is production data)
  5. Delivery semantic for billing? For map display?
CS 6500 — Big Data Analytics | Week 12

Solution

Design Item Answer Why
Topic name driver_location_updates Clear, domain-specific name
Partition count 20 Headroom for scaling; hard limit on group parallelism
Partition key driver_id Keeps all updates for one driver in order on one partition
Replication factor 3 Survives single-broker failure; production standard
Billing semantic At-least-once + idempotent write Never miss a trip segment; dedupe by event_id
Map display semantic At-most-once or at-least-once UI tolerates an occasional skipped or duplicate position

Key trade-off: driver_id distributes load by active driver count — hotspots unlikely. Key = city_id would collapse all NYC drivers into one partition, destroying per-driver ordering and creating a hotspot.

CS 6500 — Big Data Analytics | Week 12

Key Takeaways

Everything today follows from one design decision: Kafka is an immutable, distributed log.

The log:

  • Batch fails when latency must be ≤ seconds — streaming is the only answer
  • A topic is an ordered, append-only, durable record of every event
  • Partitions scale the log; same key → same partition → per-key ordering

Protecting the log:

  • Replication across brokers; ISR = replicas fully caught up with the leader
  • acks=all + RF=3 → no data loss on a single broker failure

Reading the log:

  • Offset is the read head; consuming a record does not delete it
  • Consumer groups assign each partition to one consumer — partition count = parallelism ceiling
  • Multiple groups tap the same log independently at their own pace

Crash recovery:

  • Auto-commit → unpredictable delivery depending on crash timing
  • Manual commit → at-least-once (the production default)
  • Exactly-once → available but expensive; at-least-once + idempotency achieves the same outcome

Session 2: build the Python producer and consumer; simulate failures; watch rebalancing live.

CS 6500 — Big Data Analytics | Week 12

What's Missing?

Kafka stores and delivers the log — but it wasn't built for everything

CS 6500 — Big Data Analytics | Week 12

The Gaps

  • No stream computations — Kafka stores raw events; windowed averages, running totals, and stream-stream joins require a processing layer on top — Kafka has no engine to compute them
  • No random access — you cannot efficiently query "all events for user_123 in the last hour" without scanning an entire partition; Kafka is optimized for sequential consumption, not arbitrary lookups
  • No stateful processing — Kafka delivers the events; tracking running totals, session windows, or anomaly scores across events requires an external stateful processor
  • Exactly-once end-to-end is hard — Kafka Transactions cover only the Kafka hop; true end-to-end correctness across external systems still requires idempotency at the application layer
CS 6500 — Big Data Analytics | Week 12

What Comes Next

Gap Solution When
Stream aggregations and windowed joins Spark Structured Streaming (readStream.format("kafka")) Week 13
Stateful operators — running totals, session windows Spark Structured StreaminggroupBy, window, watermark Week 13
Fault-tolerant streaming state Spark checkpointing — restart without replaying the world Week 13

Kafka is the right tool for durable, high-throughput event ingestion and fan-out — but the raw log needs a processing layer to become answers.

CS 6500 — Big Data Analytics | Week 12

4 minutes. Ask: "What partition key did you use for your Cassandra schema?" Transition: all four systems handle data at rest — today we add velocity.

Ask: "Which failure causes more real-world damage — a delayed recommendation or a delayed fraud alert?" The fraud case makes the cost visceral.

This slide is the pivot. The log isn't a Kafka feature — it's the solution shape that Kafka realizes. Everything else today follows from: how do we build and protect a distributed log?

Draw on the whiteboard: a tape that only grows to the right. Each consumer group has its own read head. The tape doesn't shrink when a head moves forward. This analogy will carry through the whole session.

Draw this. The key insight: ordering is per-partition, not per-topic. "user_123" events always land in the same partition — they stay in order. Two different users may interleave across partitions, but each user's history is intact.

Draw this. The key insight: ordering is per-partition, not per-topic. "user_123" events always land in the same partition — they stay in order. Two different users may interleave across partitions, but each user's history is intact.

5–7 minutes. Run this live. Single-broker dev: Leader = 0 for every partition. In a 3-broker cluster, leaders spread automatically — show students what that would look like.

5–7 minutes. Run this live. Single-broker dev: Leader = 0 for every partition. In a 3-broker cluster, leaders spread automatically — show students what that would look like.

This matters: acks=all means "all ISR replicas received the record in memory" — not "written to disk." Kafka bets that correlated hardware failure across racks is so unlikely that distribution provides practical durability.

"Idle consumer" is important. More consumers ≠ more throughput beyond the partition count. Emphasize: to use N consumers effectively, you need N partitions.

Fan-out is what makes Kafka the backbone of event-driven microservices. Each team owns its consumer group. New services subscribe without touching existing ones.

Ask: "For a billing system — which failure is worse: charging twice or not charging at all?" Both are bad. At-least-once + idempotent upsert is the practical answer for ~80% of use cases.

"Why not always exactly-once?" → it's a distributed transaction. Latency + complexity is non-trivial. At-least-once + idempotency achieves the same outcome with much less complexity.

5 minutes pair work, 3 minutes debrief. Expected: key=driver_id; ~20 partitions; RF=3; billing=at-least-once + idempotent fare calc; map display=at-most-once or at-least-once both fine.

Debrief focus: key=city_id → hotspots + no per-driver ordering. 3 partitions → future scaling blocked without repartitioning, which breaks ordering.