Spark DataFrames & the Structured API

Dimension	RDD	DataFrame
Schema	None (opaque objects)	Named, typed columns
Optimization	Manual (you optimize)	Automatic (Catalyst)
Code length	Verbose lambdas	Concise, declarative
Error detection	Runtime	Planning time
Cross-language	API differs per language	Same API everywhere

	Inferred	Explicit
How	`inferSchema=True`	Pass a `StructType` object
Speed	Slower (extra data pass)	Faster (no scan needed)
Safety	May guess wrong types	Guaranteed correct types
Use for	Exploration	Production

Speaker context: Students have spent a week wrestling with RDDs—writing lambdas, managing tuples, and manually optimizing. Now we introduce DataFrames: same distributed engine, but with schema awareness and automatic optimization. The "aha" moment is when they see Catalyst produce better plans than their hand-tuned RDD code. Frame DataFrames not as replacing RDDs but as a higher-level abstraction *built on* RDDs.

Speaker notes: Run live in Jupyter. After printSchema(), point out the inferred types. Ask: "What if price was inferred as String instead of Double? What would happen?" → Aggregations would fail. This motivates explicit schemas.

Speaker notes: Steps 1-3 cover reading data and filtering. Point out how printSchema() immediately tells you the column types — no guessing indexes. The filter reads like plain English.

Speaker notes: Steps 4-5 show withColumn for derived columns and filter+count for aggregation. This is a good moment to reinforce that DataFrames let you focus on *what* you want, not *how* to get it.

Spark DataFrames & the Structured API

CS 6500 — Week 6, Session 1

Week 5 Recap

The Problem with RDDs

The DataFrame Solution

Why DataFrames?

Spark's Evolution

What Is a DataFrame?

SparkSession: The New Entry Point

Creating DataFrames

Schema: Inferred vs. Explicit

Inspecting a DataFrame

Live Demo: Loading Transaction Data

Basic Operations: Select & Filter

Adding & Transforming Columns

Sorting and Limiting

Column Selection: Three Syntaxes

Activity: DataFrame Exploration (15 min)

Activity Solution (1/2) — Load, Inspect & Filter

Activity Solution (2/2) — Transform & Aggregate

Activity Debrief

RDD vs. DataFrame: Temperature Average

Key Takeaways

Preview: Session 2

References