Event arrives	Max event_time	Watermark value	9:00–9:05 window state
9:08 event	9:08	8:58	Still open
9:12 event	9:12	9:02	Still open
9:16 event	9:16	9:06	Finalized and evicted

Mode	What Gets Written	Valid When
Append	Only new, finalized rows — written once, never updated	Aggregations with watermark; append-only sinks
Update	Only rows that changed since the last trigger	Any aggregation; sinks that support UPSERT
Complete	The entire result table, every trigger	Small global aggregations only (total counts, top-N)

Query Type	Allowed Modes
Select / filter (no aggregation)	Append
Aggregation without watermark	Complete, Update
Aggregation with watermark	Append, Update, Complete
Stream-stream join	Append
Stream-static join	Append, Update

Activity 1

# Tumbling 5-minute window: avg + max temperature per sensor
tumbling = (readings
    .groupBy(window("ts", "5 minutes"), "sensor_id")
    .agg(avg("temperature").alias("avg_temp"),
         spark_max("temperature").alias("max_temp")))

# Sliding 10-minute window, 5-minute slide
sliding = (readings
    .groupBy(window("ts", "10 minutes", "5 minutes"), "sensor_id")
    .agg(avg("temperature").alias("avg_temp")))

# Query A: show tumbling output in console (easy to verify in class)
q1_view = (tumbling.writeStream
  .outputMode("update")
  .format("console")
  .option("truncate", False)
  .trigger(processingTime="10 seconds")
  .queryName("tumbling_view")
  .start())

# Query B: persist tumbling to HDFS Parquet for checkpoint recovery later
q1 = (tumbling.writeStream
  .outputMode("append")
  .format("parquet")
  .option("checkpointLocation", "hdfs:///checkpoints/tumbling/")
  .option("path", "hdfs:///output/sensor_tumbling/")
  .trigger(processingTime="10 seconds")
  .queryName("tumbling_5min")
  .start())

# Query C: sliding to console (update mode — re-emits changed windows)
q2 = (sliding.writeStream
    .outputMode("update")
    .format("console")
    .option("truncate", False)
    .trigger(processingTime="10 seconds")
    .queryName("sliding_10min")
    .start())

spark.streams.awaitAnyTermination()

Gap	Solution	When
Schedule streaming + batch jobs together	Apache Airflow DAGs	Week 15
SQL-based data transformation with testing	dbt models	Week 15
Cluster health, query lag, cost visibility	Prometheus + Grafana	(beyond course scope)
SLA contracts on pipeline freshness	SLO design patterns	(beyond course scope)

Watermarks, Joins, and Fault Tolerance

CS 6500 — Week 14, Session 2

What Happens to Your 5-Minute Click Count When a Phone Dies?

Today's Answer: Watermarks Bound the Trade-Off

Lab Goals + Environment Check

Part 1 — Watermarks and Output Modes

The Late Data Problem

How Watermarks Work

Output Modes

Output Mode Selection Rules

Lab Part 1 — Sensor Stream Pipeline

Lab Setup — Start the Sensor Producer

Lab Setup — Run Spark

Lab Setup — Jupyter Option

Activity 1

Activity 1

Demo: Inject a Late Event

Inject Dropped Event

Observe Results

Checkpoint: Lab Part 1

Verify Output

Lab Part 2 — Joins and Fault Tolerance

Activity 2: Static Join

Run Static Join

Activity 3: Stream-Stream Join

Activity 3: Read Streams

Activity 3: Join Run

Activity 4: Stop and Check

Activity 4: Restart

Inspect Checkpoint

Checkpoint Anatomy

Debrief

Discussion Questions

Key Takeaways

What's Missing?

The Gaps Structured Streaming Leaves Open

What Comes Next

Homework Reminder