Speaker context: Students have spent a week wrestling with RDDs—writing lambdas, managing tuples, and manually optimizing. Now we introduce DataFrames: same distributed engine, but with schema awareness and automatic optimization. The "aha" moment is when they see Catalyst produce better plans than their hand-tuned RDD code. Frame DataFrames not as replacing RDDs but as a higher-level abstraction *built on* RDDs.
Speaker notes: Run live in Jupyter. After printSchema(), point out the inferred types. Ask: "What if price was inferred as String instead of Double? What would happen?" → Aggregations would fail. This motivates explicit schemas.
Speaker notes: Steps 1-3 cover reading data and filtering. Point out how printSchema() immediately tells you the column types — no guessing indexes. The filter reads like plain English.
Speaker notes: Steps 4-5 show withColumn for derived columns and filter+count for aggregation. This is a good moment to reinforce that DataFrames let you focus on *what* you want, not *how* to get it.