System	Stores
HDFS / MapReduce	Batch files at scale
Spark	In-memory batch analytics
MongoDB	Flexible documents
Cassandra	High-write, query-first rows

Use Case	Cost of a 24-hour delay
Fraud detection	Card used 100× before alert
Ride-share pricing	Price is already wrong
IoT safety alert	Equipment already failed
Stock trading	Price already moved

Producer Durability

Replication exists — but the producer controls how much of it to wait for before proceeding.

`acks`	Acknowledged by	Risk	Throughput
`acks=0`	Nobody — fire and forget	Data loss on any failure	Maximum
`acks=1`	Leader only	Loss if leader dies before replication	High
`acks=all`	All ISR replicas	None within ISR	Lower

Rule of thumb: acks=all for billing and audit logs; acks=1 for metrics and clickstream.

Common misconception: acks=all is not a synchronous disk flush. Kafka's durability relies on distributing replicas across different racks and AZs — not on fsync. A simultaneous power failure on all ISR brokers would still lose data; that scenario is practically eliminated by rack placement.

The producer's acks setting is a client-side decision — per-producer, not per-topic. A billing service can use acks=all; a metrics pipeline can use acks=1. In practice, acks=all with min.insync.replicas=2 and RF=3 is the production-safe default for any data you cannot afford to lose.

Key Terms — Protecting the Log

Replication factor (RF) — how many broker copies each partition maintains; RF=3 is the production standard
Leader — the single broker handling all reads and writes for a partition
Follower — a broker replica that copies the leader's log; used for failover
ISR (In-Sync Replicas) — replicas confirmed as up-to-date with the leader; only ISR members are eligible for leader election
acks — producer durability setting: 0 = fire-and-forget, 1 = leader only, all = full ISR acknowledgment

Semantic	Mechanism	Risk	Best For
At-most-once	Commit before processing	Data loss on crash	Non-critical metrics, logs
At-least-once	Commit after processing	Duplicate messages	Most pipelines — use idempotent handlers
Exactly-once	Kafka transactions + idempotent producers	Highest complexity and latency	Financial transactions, billing

Design Item	Answer	Why
Topic name	`driver_location_updates`	Clear, domain-specific name
Partition count	20	Headroom for scaling; hard limit on group parallelism
Partition key	`driver_id`	Keeps all updates for one driver in order on one partition
Replication factor	3	Survives single-broker failure; production standard
Billing semantic	At-least-once + idempotent write	Never miss a trip segment; dedupe by `event_id`
Map display semantic	At-most-once or at-least-once	UI tolerates an occasional skipped or duplicate position

Gap	Solution	When
Stream aggregations and windowed joins	Spark Structured Streaming (`readStream.format("kafka")`)	Week 13
Stateful operators — running totals, session windows	Spark Structured Streaming — `groupBy`, `window`, `watermark`	Week 13
Fault-tolerant streaming state	Spark checkpointing — restart without replaying the world	Week 13

Kafka: The Commit Log

CS 6500 — Week 12, Session 1

The 9 AM Problem

The Answer: A Log

Week 11 Recap

Data at Rest

Why Streaming?

Batch Fails

The Ideal System

The Log

Topics

Partitions

Partitions

Demo: Create Topic

Demo: Create Topic

Protecting the Log

Replication

Producer Durability

Reading the Log

The Offset

Consumer Groups

Fan-Out

When Readers Crash

Offset Management

Delivery Semantics

Design Activity

Design Exercise

Solution

Key Takeaways

What's Missing?

The Gaps

What Comes Next