Use Case	State Needed
Count transactions per user	Integer count per user_id
Running average spend	(sum, count) per user_id
Detect 3 failures in 60 seconds	List of recent failure timestamps
Session-level revenue	Map of item → quantity in session

Type	Stores	Use When
`ValueState[T]`	A single value per key	Counters, flags, last-seen values
`ListState[T]`	A list of values per key	Collecting events before aggregating
`MapState[K, V]`	A map per key	Session item counts, feature vectors
`ReducingState[T]`	Pre-aggregated value per key	Running sum, min, max

Backend	Storage	Use When
Memory	JVM heap	Dev / tiny state
FsStateBackend	Heap + HDFS snapshots	Medium state
RocksDB	Disk (incremental snapshots)	Large state (GBs+)

Activity	What You Build
1	Parse Kafka stream + count per user (ValueState)
2	Alert on 5+ transactions in 60 seconds
3	Add exactly-once checkpointing
4	Kill and restart — verify no duplicate alerts

Activity 2 — Step 3

        # Step 3 of 3 — alert check (inside process_element, after pruning ↑)
        if len(recent) > self.THRESHOLD:
            yield (f"ALERT user={tx['user_id']} "
                   f"count={len(recent)} "
                   f"last_amount={tx.get('amount', '?')}")

# Wire into the pipeline (replace stream.print() from Activity 1)
stream.process(
    FraudDetector(), output_type=Types.STRING()).print()
env.execute("Fraud Detector v2")

Users u901 and u902 burst every 30 s — ALERT lines should appear within a minute of running.


---

# Lab: Checkpointing

**Challenge:** Your fraud detector is running. The cluster crashes after processing 10 million events. When it restarts, does it re-read from offset 0 — re-alerting on transactions it already processed? Or does it pick up exactly where it left off?

*Predict before running:*

- Without checkpointing, where does Flink's Kafka consumer start on restart?
- What two things must Flink snapshot simultaneously to guarantee no duplicates?
- How long is your worst-case reprocessing window if you checkpoint every 30 seconds?

Add the checkpoint block, run the job, let at least 2 checkpoints complete in the Web UI, then kill and restart. Watch the Kafka offset in the Web UI to verify it resumes mid-stream.

<!-- ~15 minutes -->

---

# Activity 3 — Step 1

```python
# Step 1 of 3 — add immediately after env = ...get_execution_environment()
env.enable_checkpointing(30_000)  # checkpoint every 30 seconds

Without this, a restart re-reads from Kafka offset 0 — every ALERT fires a second time for already-processed events.

Gap	Solution	When
Simpler micro-batch streaming	Spark Structured Streaming	Week 14
SQL transforms with lineage + tests	dbt	Week 15
Pipeline scheduling and alerting	Apache Airflow	Week 15

Flink Stateful Apps

CS 6500 — Week 13, Session 2

Crash and Recover

Checkpoints Deliver

Environment Check

Stateful Operators

What Is State?

Keyed State Types

ProcessFunction

Timer and Alert

Windows in Flink

Window Types

Exactly-Once Kafka

Checkpoint Protocol

Checkpoint Config

State Backends

Fraud Detection Lab

Lab Setup

Lab: Read Kafka

Activity 1 — Step 1

Activity 1 — Step 2

Activity 1 — Step 3

Lab: Add Alerts

Activity 2 — Step 1

Activity 2 — Step 2

Activity 2 — Step 3

Activity 3 — Step 2

Activity 3 — Step 3

Debrief

Debrief Questions

Key Takeaways

What's Missing?

Flink's Gaps

What Comes Next

Homework