Spark Structured Streaming

Scenario	Batch lag	Real cost
Fraud detection	30-minute batch	11 more fraudulent transactions cleared
Ride-sharing demand	Hourly batch	Surge prediction arrives after the rush
E-commerce trending	Daily batch	Viral tweet at 2:45 PM missed until tomorrow
Security intrusion	Nightly log scan	Attacker has been inside for 8 hours

Source	`format(...)`	Use Case
Kafka	`"kafka"`	Event streams — the production standard
File (HDFS/S3)	`"parquet"`, `"csv"`, `"json"`	Incremental file landing (ETL pipelines)
Socket	`"socket"`	Development and testing only (no fault tolerance)
Rate	`"rate"`	Benchmarking and demos (generates synthetic rows)

Sink	`format(...)`	Use Case
Console	`"console"`	Debugging — never in production
Parquet / CSV	`"parquet"`, `"csv"`	Persistent data lake output
Kafka	`"kafka"`	Downstream stream consumers
Memory	`"memory"`	Unit testing only
Delta Lake	`"delta"`	Production ACID streaming sink

Window Type	Shape	Event Membership	When to Use
Tumbling	Fixed, aligned	Exactly one window	Periodic reports, billing intervals
Sliding	Fixed, overlapping	Multiple windows	Smoothed metrics, rolling averages
Session	Variable, gap-based	One session per user	UX analytics, user journey analysis

Gap	Solution	When
Batch SQL transformations with lineage	dbt	Week 15
Pipeline scheduling and retry logic	Apache Airflow	Week 15
Production monitoring and SLA enforcement	Prometheus + Grafana	(beyond course scope)
Cloud-managed streaming at scale	AWS Kinesis / MSK	(reference)

Warm-up question: "What's the difference between a Kafka consumer and a streaming processor?" Answer: consumer reads events and handles them one-at-a-time in application code; streaming processor applies transformations, aggregations, and joins over a continuous stream using a query engine.

Ask: "Which of these would your company lose money over in a real-time vs. 30-minute scenario?"

Note: continuous processing was added in Spark 2.3 but has seen limited adoption. Flink owns the true-streaming market.

Point out: if students already know the DataFrame API from weeks 5-6, they already know ~80% of Structured Streaming. The new concepts are sources, sinks, triggers, output modes, and watermarks.

Trigger options: processingTime="5 seconds" (most common), once=True (process all available, stop), availableNow=True (Spark 3.3+), continuous="1 second" (experimental).

Emphasize: the checkpoint is what makes streaming reliable. Without it, a restart would re-process from the beginning or skip events.

5 minutes. Get this running before class. If Kafka is slow to start, have a backup socket-source version ready.

Point out: raw.isStreaming == True. The DataFrame API is identical — just readStream instead of read.

Watch for: students not seeing output. Usually: wrong topic name, Kafka not running, or startingOffsets="latest" when data was already there. Switch to "earliest" if needed.

Classic misconception: "why can't I just use when Spark sees it?" — because network delays, mobile buffering, and batch uploads mean events arrive out of order. Event time is ground truth.

Session windows were added in Spark 3.2. Require watermarks in production — otherwise state grows unboundedly.

Answers: 1=Tumbling 1hr, 2=Sliding 30s/10s, 3=Session 20min, 4=Tumbling 5min. Ask students to justify their choice — the "why" matters more than the label.

Spark Structured Streaming

CS 6500 — Week 14, Session 1

A Card Swipes in a Store — 200ms Later, Approved or Denied

Today's Answer: Process Events as They Arrive

Week 12 Recap

Why Batch Isn't Enough

The Batch Problem

Streaming vs. Batch — The Core Difference

Micro-Batch vs. Continuous Processing

When to Choose Streaming

Structured Streaming Architecture

The Unified API

Input Sources

Output Sinks

The Streaming Query Lifecycle

Demo — Kafka → Spark

Demo: Setup

Demo: Session Setup

Demo: Define Schema

Demo: Parse JSON and Start Query

Windowing on Event Time

Event Time vs. Processing Time

Tumbling Windows

Run Tumbling

Sliding Windows

Session Windows

Choosing the Right Window

Activity: Window Design Challenge

Key Takeaways

What's Missing?

The Gaps Structured Streaming Leaves Open

What Comes Next