Spark SQL & Advanced DataFrame Operations

Approach	Developer effort	Performance
Hand-tuned RDD	High (you optimize)	Good (if you're skilled)
Naive DataFrame	Low (declarative)	Good (Catalyst optimizes)
Tuned DataFrame	Medium	Excellent

user_id	timestamp	price	running_total
U501	01-15	50	50
U501	02-10	30	80
U501	03-05	45	125

Function	Description
`rank()`	Rank with gaps (1, 2, 2, 4)
`dense_rank()`	Rank without gaps (1, 2, 2, 3)
`row_number()`	Unique sequential number
`lag(col, n)`	Value n rows before current
`lead(col, n)`	Value n rows after current
`sum().over(w)`	Running/cumulative sum
`avg().over(w)`	Moving average

Use DataFrames	Use RDDs
Structured/semi-structured data	Raw text, binary data
Standard analytics (filter, group, join)	Custom aggregation logic
SQL-like queries	Fine-grained partition control
Production pipelines	Legacy RDD codebases
When performance matters most	`zipWithIndex`, `glom`, etc.

Speaker context: Students can now create DataFrames, select, filter, and add columns. This session elevates their skills: SQL queries on distributed data, multi-function aggregations, joins, and window functions. The Catalyst optimizer section shows *why* declarative code wins—Spark rewrites their queries automatically. End with RDD↔DataFrame conversion so they know when to drop down to RDDs. Remind: Assignment 2 preview next week.

Speaker notes: Run both live, then show .explain() for each. The plans will be identical. "This is Catalyst at work—it doesn't care how you wrote the query, it optimizes the same way."

Speaker notes: Run .explain(True) live and walk through the physical plan. Point out "PushedFilters" in the scan node. "Spark read fewer bytes from disk because it pushed your filter down to the data source. You didn't ask for this—Catalyst did it automatically."

Speaker notes: Poll the class. Typically SQL-experienced students prefer SQL strings; CS students prefer DataFrame API. Emphasize: there's no wrong answer, both compile identically.

Spark SQL & Advanced DataFrame Operations

CS 6500 — Week 6, Session 2

What Is Spark SQL?

Temporary Views

SQL vs. DataFrame API: Same Result

Live Demo: Aggregation Queries

Multi-Function Aggregation (DataFrame API)

Joins in Spark SQL

Joins with DataFrame API

The Catalyst Optimizer

Catalyst in Action

Why Catalyst Matters

Window Functions: The Power Tool

Window Function Syntax

Running Total Example

Common Window Functions

Converting: RDD DataFrame

When to Use DataFrames vs. RDDs

Activity: E-Commerce Analytics (10 min)

Activity Debrief

Activity Solutions (1/4)

Activity Solutions (2/4)

Activity Solutions (3/4)

Activity Solutions (4/4)

Saving Results

Key Takeaways

What's Next: Week 7

References