Speaker context: Students can now create DataFrames, select, filter, and add columns. This session elevates their skills: SQL queries on distributed data, multi-function aggregations, joins, and window functions. The Catalyst optimizer section shows *why* declarative code wins—Spark rewrites their queries automatically. End with RDD↔DataFrame conversion so they know when to drop down to RDDs. Remind: Assignment 2 preview next week.
Speaker notes: Run both live, then show .explain() for each. The plans will be identical. "This is Catalyst at work—it doesn't care how you wrote the query, it optimizes the same way."
Speaker notes: Run .explain(True) live and walk through the physical plan. Point out "PushedFilters" in the scan node. "Spark read fewer bytes from disk because it pushed your filter down to the data source. You didn't ask for this—Catalyst did it automatically."
Speaker notes: Poll the class. Typically SQL-experienced students prefer SQL strings; CS students prefer DataFrame API. Emphasize: there's no wrong answer, both compile identically.