RDD Programming with PySpark

	MapReduce	Spark
Intermediate data	Written to disk (HDFS)	Recorded as lineage graph
Recovery method	Re-read from replicated disk	Recompute lost partitions
Cost	Storage + I/O overhead	CPU time for recomputation

Operation	Description	SQL Analogy
`reduceByKey(f)`	Aggregate values per key	`GROUP BY + SUM`
`groupByKey()`	Group all values per key	`GROUP BY`
`mapValues(f)`	Transform values only	`SELECT key, f(value)`
`keys()`	Extract all keys	`SELECT DISTINCT key`
`values()`	Extract all values	`SELECT value`
`sortByKey()`	Sort by key	`ORDER BY key`
`join(other)`	Inner join on key	`INNER JOIN`
`countByKey()`	Count per key (action!)	`GROUP BY + COUNT`

	`reduceByKey`	`groupByKey`
Behavior	Combines locally, then shuffles	Shuffles ALL values, then groups
Network	Minimal (pre-aggregated)	Maximal (every value sent)
Memory	Low (only aggregated values)	High (all values in memory)
Use for	Aggregation (sum, count, max)	When you need all values per key

Step	`reduceByKey`	`groupByKey`
Partition 1	`(a,1) (a,1) (b,1)`	`(a,1) (a,1) (b,1)`
Local combine	`(a,2) (b,1)`	(no combine)
↓ Shuffle	2 records sent	3 records sent (3× more!)
Final result	`(a, 5)`	`(a, [1,1,1,1,1])` → sum → `(a, 5)`

Aspect	MapReduce (mrjob)	Spark (PySpark)
Code lines	~50	~5
Paradigm	Class-based (mapper/reducer)	Functional chain
Intermediate data	Disk	Memory
Execution	Submit → wait → check	Interactive / notebook
Iteration	New job per iteration	In-memory loop
Debugging	Log files	Spark UI + REPL

Mistake	Wrong	Right
`collect()` on large RDD	`huge_rdd.collect()` → OOM	`huge_rdd.take(100)` or `saveAsTextFile()`
`groupByKey` for aggregation	`rdd.groupByKey().mapValues(sum)`	`rdd.reduceByKey(lambda a, b: a + b)`
Forgetting lazy eval	`rdd.map(lambda x: x*2)` → no output	`rdd.map(lambda x: x*2).collect()`

Speaker context: This session is 80% hands-on coding. Students have the architectural understanding from Session 1—now they need to build muscle memory with RDD operations. The centerpiece is comparing Spark word count to their MapReduce version: 5 lines vs. 50+. We'll then progress to pair RDDs, caching, and partitioning. By end of session, students should feel confident writing basic Spark programs independently. Remind them: Project Proposal is due Sunday!

Speaker notes: Draw on whiteboard. "Think of narrow as a one-to-one pipe. Wide is a many-to-many shuffle—every partition needs to send data to every other partition. That's why wide dependencies are expensive and create stage boundaries."

Speaker notes: This is the #1 performance mistake Spark beginners make. Hammer this point. "Every time you write groupByKey, ask yourself: can I use reduceByKey instead? The answer is almost always yes."

Speaker notes: Switch to Spark UI at localhost:4040. Walk through Jobs tab, click into the job, show stages. Click into Stage 0 to show tasks. Point out shuffle write in Stage 0 and shuffle read in Stage 1. "This shuffle is the reduceByKey—same concept as MapReduce shuffle, but Spark keeps results in memory."

Speaker notes: Ask if anyone used groupByKey. If so, use this slide to compare. "This is the same combiner concept from MapReduce—pre-aggregate locally before shuffling."

Speaker notes: Run this live with time.time() around each call. Check Spark UI Storage tab to confirm data is cached. The dramatic difference sells the concept.

Speaker notes: This is pure Spark execution mechanics—if you understand this, you control performance. Start with the mental model: A partition is the unit of parallelism. One task per partition. More partitions → more parallel tasks. Fewer partitions → fewer tasks. Now explain `repartition(n)`: Spark performs a **full shuffle**, redistributing ALL data across the cluster to create evenly-sized partitions. This is expensive (network + disk I/O) but gives you balanced parallelism. Use when: you need more partitions, you need to fix skew, or you're about to do heavy computation like joins. For `coalesce(n)`: Spark merges existing partitions WITHOUT a shuffle—data stays mostly where it already is. This is cheap but may create uneven partition sizes. Use when: reducing partitions before writing output files, or after a filter that drastically shrank your data. Critical insight: `coalesce()` cannot increase partitions effectively—if you have 10 partitions and call `coalesce(100)`, you'd just get empty partitions because there's no data movement. That's why Spark requires `repartition()` for expansion. Draw on whiteboard: show 8 partitions [P1][P2][P3][P4][P5][P6][P7][P8]. With `coalesce(4)`, adjacent partitions merge: [P1+P2][P3+P4][P5+P6][P7+P8]—no data crosses machines. With `repartition(4)`, everything gets shuffled into a balanced mix across 4 new partitions. Under the hood, `repartition(n)` is just `coalesce(n, shuffle=True)`. Pragmatic rule: most people under-partition and over-coalesce. Keep partitions at 2–4× total cluster cores, repartition before heavy joins/aggregations, and coalesce before writing small output files.

Level	Disk	Copies
`MEMORY_ONLY`		1
`MEMORY_AND_DISK`	(spillover)	1
`MEMORY_ONLY_SER`		1
`DISK_ONLY`		1
`MEMORY_ONLY_2`		2

RDD Programming with PySpark

CS 6500 — Week 5, Session 2

Warm-Up: RDD Lineage & Fault Tolerance

RDD Lineage & Fault Tolerance

Narrow vs. Wide Dependencies

Key-Value Pair RDDs

Essential Pair RDD Operations

Critical: reduceByKey vs. groupByKey

reduceByKey vs. groupByKey: Visual

Live Coding: Word Count in Spark

Word Count: Line-by-Line (1/2)

Word Count: Line-by-Line (2/2)

Spark UI: What Just Happened?

MapReduce vs. Spark: Side-by-Side

Hands-On: Temperature Analysis Challenge

Temperature Analysis: Starter Code

Temperature Analysis: Solution

Solution Walkthrough

Why Not groupByKey?

The Recomputation Problem

Caching: The Solution

Storage Levels

When to Cache (and When Not To)

Cache when:

Don't cache when:

Example: Caching Performance

Partitioning: Controlling Parallelism

Controlling Partitions

Partitioning Impact on Shuffles

Putting It All Together

Common Mistakes to Avoid

Project Proposal — Due Sunday 11:59 PM

Key Takeaways

What's Next: Week 6

References