Apache Flink

Latency Class	Range	Examples
Real-time	< 10ms	Fraud block, trading, IoT control
Near real-time	100ms–1s	Live dashboards, alerting, recommendations
Micro-batch	1s–30s	Spark Structured Streaming
Mini-batch	1–5 min	Some Flink applications
Batch	Minutes–hours	Nightly ETL, ML training

Operation	Description	Example
`map`	1-to-1 record transform	parse JSON → typed tuple
`flatMap`	1-to-N (or 0)	tokenize sentence into words
`filter`	Keep matching records	keep only `status == "error"`
`keyBy`	Partition by key	group by `user_id`
`reduce`	Stateful aggregation per key	running sum per user
`process`	Full control with state + timers	custom stateful logic

Source	Description	Production?
`from_collection`	In-memory list	Dev/testing only
`KafkaSource` (builder API)	Kafka topic with offset tracking	Yes
`FileSystem`	HDFS/S3 file discovery	Yes
`SocketTextStream`	TCP socket	Dev/testing only

Sink	Description	Production?
`print()`	Console output	Dev/testing only
`FlinkKafkaProducer`	Kafka topic	Yes
`FileSink` (streaming)	Parquet/ORC to HDFS	Yes
JDBC	Relational database	Yes

Source type	Typical delay	Reasoning
Internal service (same DC)	0–1 s	Sub-millisecond network; rare reordering
Mobile / IoT device	10–60 s	Offline buffering, reconnect bursts
Cross-region pipeline	5–30 s	Variable WAN latency
Kafka (same cluster)	0–5 s	Partition rebalance or broker pause

Type	Shape	Use case
Tumbling	Fixed, non-overlapping	"Totals per 5-minute interval"
Sliding	Fixed size, advances by step	"5-min avg, updated every 1 min"
Session	Closes after N minutes of inactivity	"Events in one user session"
Global	No time boundary; fires every N events	"Micro-batch by count"

Gap	Solution	When
Micro-batch with simpler API	Spark Structured Streaming	Week 14
SQL transformations with testing	dbt	Week 15
Pipeline scheduling + retry logic	Apache Airflow	Week 15

Welcome. Today introduces true stream processing — the hardest topic in the course conceptually. Pace slowly through architecture; students will need Session 2's lab to cement it.

Let this hang. Ask: "How fast is 50ms? Blink takes about 150ms." Don't answer the question yet — carry it through the whole session.

Tell students the fraud scenario from slide 2 is real — PayPal runs Flink at ~1M events/sec. That's the scale we're building toward in Session 2's lab.

Quick show of hands: "Who got their Kafka producer working in last week's lab?" Use this to gauge where to slow down if needed.

Ask: "Where does your current job's data pipeline fall on this table?" Most students have only seen batch. Use this to make the spectrum feel real.

Common student reaction: "Why not always use Flink?" Answer: operational complexity. Flink requires understanding state backends, watermarks, and checkpoint tuning. Spark Structured Streaming gives 90% of the value with 30% of the complexity.

Walk through the diagram left-to-right: client submits a job → JobManager turns it into a task graph → TaskManagers execute subtasks in slots. Point out that each colored box is an operator, each row is a TaskManager.

Analogy: JobManager is the conductor, TaskManagers are section players, slots are individual musicians. Each slot can play one part of the pipeline simultaneously.

Ask: "If you have 3 TaskManagers each with 4 slots, what's your maximum parallelism?" Answer: 12. Then: "If your source Kafka topic has only 3 partitions, what's the effective source parallelism ceiling?" Answer: 3 — partitions limit source parallelism.

Demo point: during Session 2's lab, open the Web UI backpressure tab while the job is running. Point out the green/yellow/red indicators. This makes an abstract concept visible.

Emphasize the lazy execution model — identical to PySpark. Nothing runs until `execute()`. Ask: "What does this remind you of from Week 8?" Students should recognize the RDD/DataFrame lazy evaluation pattern.

Ask students to match each operation to something from MapReduce or Spark they already know. map=map, flatMap=flatMap, filter=filter, keyBy=groupBy/partitionBy, reduce=reduceByKey. process has no clean Spark equivalent — that's what makes Flink unique.

Key point: you cannot access state without first calling keyBy. Flink enforces this at the API level — calling get_state() outside a keyed context throws at runtime. Session 2 will drill this in the lab.

Note: FlinkKafkaConsumer is what students will use in Session 2's lab. It integrates with Flink's checkpoint protocol — when a checkpoint completes, committed Kafka offsets are part of the snapshot. That's how exactly-once works end-to-end.

Common mistake: students use print() in their homework and wonder why performance is terrible. print() is synchronous and serialized — it becomes the bottleneck immediately. The autograder accepts it for correctness, but flag this explicitly.

Start this before class. Have the Web UI open on the projector. Walk through each tab briefly — students will need to navigate the UI in Session 2's lab. Point out the "Running Jobs" throughput graph; it updates in real time.

Type: "big data big data streaming" — watch word counts increment live. Then type the same phrase again and point out counts accumulate across inputs: that's the stateful reduce. Ask: "How is this different from a batch word count?" The job never terminates — it waits for the next line.

Ask: "When would processing time be acceptable?" Answer: when you control the producer and events arrive within milliseconds of occurrence — e.g., internal metrics pipelines, not mobile or IoT. Any time there's a network hop that could buffer, use event time.

Draw on the board: a timeline with events arriving out of order. Mark the watermark as a cursor that advances. Show visually how the window fires when the watermark crosses the window boundary. This is the concept students most often get wrong — the watermark is not a timeout, it's a progress signal derived from the data itself.

In an ordered stream the watermark equals max_event_time at every step. This is the ideal case: every window fires the moment its last event arrives. Point out the W(t) label advancing in lockstep with events.

Walk through the diagram: events arrive out of order; the watermark (dashed line) trails behind. Events that arrive BEFORE the watermark line crosses their window boundary are accepted. Events that arrive AFTER the watermark has passed their window are dropped (or sent to a side output). The bound is the gap between the event line and the watermark line.

Ask: "If subtask 1 has watermark W=9:01 and subtask 2 has W=8:58, what watermark does the downstream operator see?" Answer: 8:58 — the minimum. This is why a single slow partition can hold up an entire pipeline.

Ask: "What happens if you set the bound too small?" → valid late events are dropped. "Too large?" → windows hold state for longer, increasing memory usage and output latency. The tradeoff is memory vs. correctness.

Quick poll: "Which window would you use for hourly revenue totals?" (tumbling 1h). "For a 15-minute moving average updated every 5 minutes?" (sliding 15m/5m). "For counting clicks within a single page visit?" (session with gap = idle timeout). This primes the activity.

Point out: tumbling = no overlap (each event in exactly one bucket); sliding = overlap (event counted in multiple windows); session = variable-width gap-based. The diagram makes the "overlap" property of sliding windows immediately visible.

Answers: 1=30 seconds, 2=tumbling 1 minute, 3=when watermark reaches 9:01:00 → at processing time when max_event_time ≥ 9:01:30, 4=dropped (45s > 30s watermark). For Q3: walk through the math slowly — max_event_time must reach 9:01:30 before Flink considers the 9:00–9:01 window complete.

Give students 2 minutes to write down one concept they're still unclear on. Collect on a sticky note or poll — use the responses to prioritize what to revisit at the start of Session 2.

Apache Flink

CS 6500 — Week 13, Session 1

50ms or Blocked?

Streaming Answers

Week 12 Recap

Why Latency Matters

The Latency Spectrum

Batch vs Streaming

Flink Architecture

Architecture Diagram

Flink's Runtime

Parallelism

Backpressure

The DataStream API

DataStream Basics

Transformations

Stateless vs Stateful

Sources and Sinks

Sinks

Demo — Word Count

Demo: Setup

Demo: Word Count

Demo: Running in Docker

Event Time in Flink

Event vs Processing

Watermarks in Flink

Ordered Stream

Late Arrivals

Parallel Watermarks

Choosing the Delay

Window Types

Window Shapes

Watermark Activity

Key Takeaways

What's Missing?

Flink's Gaps

What Comes Next