Welcome. Today introduces true stream processing — the hardest topic in the course conceptually. Pace slowly through architecture; students will need Session 2's lab to cement it.
Let this hang. Ask: "How fast is 50ms? Blink takes about 150ms." Don't answer the question yet — carry it through the whole session.
Tell students the fraud scenario from slide 2 is real — PayPal runs Flink at ~1M events/sec. That's the scale we're building toward in Session 2's lab.
Quick show of hands: "Who got their Kafka producer working in last week's lab?" Use this to gauge where to slow down if needed.
Ask: "Where does your current job's data pipeline fall on this table?" Most students have only seen batch. Use this to make the spectrum feel real.
Common student reaction: "Why not always use Flink?" Answer: operational complexity. Flink requires understanding state backends, watermarks, and checkpoint tuning. Spark Structured Streaming gives 90% of the value with 30% of the complexity.
Walk through the diagram left-to-right: client submits a job → JobManager turns it into a task graph → TaskManagers execute subtasks in slots. Point out that each colored box is an operator, each row is a TaskManager.
Analogy: JobManager is the conductor, TaskManagers are section players, slots are individual musicians. Each slot can play one part of the pipeline simultaneously.
Ask: "If you have 3 TaskManagers each with 4 slots, what's your maximum parallelism?" Answer: 12. Then: "If your source Kafka topic has only 3 partitions, what's the effective source parallelism ceiling?" Answer: 3 — partitions limit source parallelism.
Demo point: during Session 2's lab, open the Web UI backpressure tab while the job is running. Point out the green/yellow/red indicators. This makes an abstract concept visible.
Emphasize the lazy execution model — identical to PySpark. Nothing runs until `execute()`. Ask: "What does this remind you of from Week 8?" Students should recognize the RDD/DataFrame lazy evaluation pattern.
Ask students to match each operation to something from MapReduce or Spark they already know. map=map, flatMap=flatMap, filter=filter, keyBy=groupBy/partitionBy, reduce=reduceByKey. process has no clean Spark equivalent — that's what makes Flink unique.
Key point: you cannot access state without first calling keyBy. Flink enforces this at the API level — calling get_state() outside a keyed context throws at runtime. Session 2 will drill this in the lab.
Note: FlinkKafkaConsumer is what students will use in Session 2's lab. It integrates with Flink's checkpoint protocol — when a checkpoint completes, committed Kafka offsets are part of the snapshot. That's how exactly-once works end-to-end.
Common mistake: students use print() in their homework and wonder why performance is terrible. print() is synchronous and serialized — it becomes the bottleneck immediately. The autograder accepts it for correctness, but flag this explicitly.
Start this before class. Have the Web UI open on the projector. Walk through each tab briefly — students will need to navigate the UI in Session 2's lab. Point out the "Running Jobs" throughput graph; it updates in real time.
Type: "big data big data streaming" — watch word counts increment live. Then type the same phrase again and point out counts accumulate across inputs: that's the stateful reduce. Ask: "How is this different from a batch word count?" The job never terminates — it waits for the next line.
Type: "big data big data streaming" — watch word counts increment live. Then type the same phrase again and point out counts accumulate across inputs: that's the stateful reduce. Ask: "How is this different from a batch word count?" The job never terminates — it waits for the next line.
Ask: "When would processing time be acceptable?" Answer: when you control the producer and events arrive within milliseconds of occurrence — e.g., internal metrics pipelines, not mobile or IoT. Any time there's a network hop that could buffer, use event time.
Draw on the board: a timeline with events arriving out of order. Mark the watermark as a cursor that advances. Show visually how the window fires when the watermark crosses the window boundary. This is the concept students most often get wrong — the watermark is not a timeout, it's a progress signal derived from the data itself.
In an ordered stream the watermark equals max_event_time at every step. This is the ideal case: every window fires the moment its last event arrives. Point out the W(t) label advancing in lockstep with events.
Walk through the diagram: events arrive out of order; the watermark (dashed line) trails behind. Events that arrive BEFORE the watermark line crosses their window boundary are accepted. Events that arrive AFTER the watermark has passed their window are dropped (or sent to a side output). The bound is the gap between the event line and the watermark line.
Ask: "If subtask 1 has watermark W=9:01 and subtask 2 has W=8:58, what watermark does the downstream operator see?" Answer: 8:58 — the minimum. This is why a single slow partition can hold up an entire pipeline.
Ask: "What happens if you set the bound too small?" → valid late events are dropped. "Too large?" → windows hold state for longer, increasing memory usage and output latency. The tradeoff is memory vs. correctness.
Quick poll: "Which window would you use for hourly revenue totals?" (tumbling 1h). "For a 15-minute moving average updated every 5 minutes?" (sliding 15m/5m). "For counting clicks within a single page visit?" (session with gap = idle timeout). This primes the activity.
Point out: tumbling = no overlap (each event in exactly one bucket); sliding = overlap (event counted in multiple windows); session = variable-width gap-based. The diagram makes the "overlap" property of sliding windows immediately visible.
Answers: 1=30 seconds, 2=tumbling 1 minute, 3=when watermark reaches 9:01:00 → at processing time when max_event_time ≥ 9:01:30, 4=dropped (45s > 30s watermark). For Q3: walk through the math slowly — max_event_time must reach 9:01:30 before Flink considers the 9:00–9:01 window complete.
Give students 2 minutes to write down one concept they're still unclear on. Collect on a sticky note or poll — use the responses to prioritize what to revisit at the start of Session 2.