Apache Spark: Architecture & RDD Fundamentals

Metric	MapReduce	Spark
Word count (10GB)	~8 min	~2 min
K-means (10 iterations, 100GB)	~180 min	~12 min
Interactive query (cached 50GB)	~5 min/query	~3 sec/query
Lines of code (word count)	50+	5
Intermediate storage	Disk (HDFS)	Memory (RAM)

Use Spark When...	Stick with MapReduce When...
Iterative algorithms (ML, graph)	Simple one-pass ETL
Interactive exploration	Very memory-constrained clusters
Multi-step pipelines	Legacy Hadoop-only infrastructure
Real-time / near-real-time	Ultra-reliable batch (proven track record)
You want concise code	Existing MapReduce codebase

Level	Description	Speed
PROCESS_LOCAL	Data in executor's memory	Fastest
NODE_LOCAL	Data on same node's disk	Fast
RACK_LOCAL	Data on same rack	Moderate
ANY	Data on remote rack	Slowest

Transformation	Description	Example
`map(f)`	Apply f to each element	`rdd.map(lambda x: x * 2)`
`filter(f)`	Keep elements where f is True	`rdd.filter(lambda x: x > 5)`
`flatMap(f)`	Map + flatten results	`rdd.flatMap(lambda x: x.split())`
`distinct()`	Remove duplicates	`rdd.distinct()`
`union(other)`	Combine two RDDs	`rdd1.union(rdd2)`
`sample(frac)`	Random sample	`rdd.sample(False, 0.1)`

Action	Description	Returns
`count()`	Number of elements	Integer
`collect()`	All elements to driver	List
`take(n)`	First n elements	List
`first()`	First element	Element
`reduce(f)`	Aggregate all elements	Value
`saveAsTextFile(path)`	Write to HDFS	None
`foreach(f)`	Apply f (side effects)	None

#	Operation	T or A?
1	`map()`
2	`count()`
3	`filter()`
4	`collect()`
5	`reduceByKey()`
6	`saveAsTextFile()`

#	Operation	Answer
1	`map()`	T
2	`count()`	A
3	`filter()`	T
4	`collect()`	A
5	`reduceByKey()`	T
6	`saveAsTextFile()`	A

Speaker context: This session marks the pivot from MapReduce to Spark. Students have just spent two weeks writing MapReduce jobs and have felt the pain of disk-based intermediate storage, verbose code, and iterative-algorithm limitations. We'll use that pain to motivate Spark's design decisions. The goal isn't just "Spark is faster"—it's understanding *why* in-memory DAG execution changes what's feasible at scale. By end of session, students should be able to explain Spark's architecture and write basic RDD operations.

Speaker notes: Quick visual reminder of the Hadoop ecosystem we've been working with. Point out HDFS (storage), YARN (resource management), and MapReduce (processing). Today we're adding Spark as an alternative processing engine that also runs on YARN and uses HDFS for storage.

Speaker notes: Let students articulate the pain. Ask: "What was the most frustrating thing about writing MapReduce?" Collect 3-4 answers. Then transition: "Spark was designed to solve exactly these problems."

Speaker notes: This is a discovery activity before formal instruction. Students learn better when they've tried to figure something out first. Circulate and listen to conversations. After 10 minutes, collect 2-3 answers from the class for each question. Their answers will likely be surface-level—that's perfect. Use their responses as hooks: "Great! Now let me show you what's really happening under the hood..."

Speaker notes: Emphasize that Spark doesn't violate physics. "The 100× speedup is specifically for iterative algorithms where MapReduce re-reads from disk every iteration. For single-pass jobs, the speedup is more like 2-5×. Understanding *when* Spark wins is as important as knowing *that* it wins."

Speaker notes: Expand on each component as you present. Driver is like the conductor of an orchestra—it doesn't play the instruments (process data), it coordinates. Executors are the musicians. Cluster manager is the venue that provides the stage and resources. Walk through a simple example: "When you call collect(), the driver sends tasks to executors, they process their partitions, and send results back."

Speaker notes: Emphasize the "local[*]" option for development—it uses all your laptop cores and is great for testing before moving to the cluster. The "one SparkContext per JVM" rule catches people off guard—if you try to create a second one, you'll get an error. In Jupyter notebooks, this means you need to restart the kernel if you want to reconfigure. In Spark 2.0+, SparkSession is preferred and handles this more gracefully, but we're starting with SparkContext to understand the fundamentals.

Speaker notes: This hierarchy is crucial for understanding Spark UI. Walk through the example step by step: "textFile, flatMap, and map can all happen in one stage because they're 'narrow'—each input partition maps to exactly one output partition. But reduceByKey is 'wide'—it needs to shuffle data across the cluster to group by key. That shuffle creates a stage boundary." Draw this on the board: show how Stage 1 outputs are shuffled to Stage 2 inputs. The number of tasks in each stage equals the number of partitions at that point in the DAG.

Speaker notes: Draw on whiteboard. "MapReduce is like a fixed assembly line—two stations only. Spark is like a flexible factory—it designs the assembly line to fit your specific product."

Speaker notes: RDD = Resilient Distributed Dataset. Let's unpack each word: **Resilient** means fault-tolerant. If a worker node crashes and loses a partition, Spark can recompute just that partition using the lineage graph (the chain of transformations that created it). No need to recompute everything. **Distributed** means the data is partitioned across multiple nodes in the cluster. A 1TB RDD might have 1000 partitions of 1GB each, spread across 100 nodes. **Dataset** is just a collection of records. Unlike a database table, records can be any type: strings, integers, tuples, custom Python objects, even other RDDs. Think of an RDD as an immutable, distributed Python list that remembers how it was created. That "memory" (lineage) is what enables fault tolerance without expensive replication.

1. **Partitions** — how data is split across the cluster 2. **Compute function** — how to compute each partition 3. **Dependencies** — which parent RDD(s) it depends on 4. **Partitioner** — how keys are distributed (for pair RDDs) 5. **Preferred locations** — where to place computation (data locality)

Speaker notes: Run this live in Jupyter. After collect(), switch to Spark UI (localhost:4040) and show the job, stages, and tasks. Point out that filter and map are in the same stage (narrow transforms, pipelined). Ask: "How many tasks?" → depends on number of partitions.

Speaker notes: After running, go to Spark UI. Show that count() triggered one job, but the second count() triggers another. "Each action = new job. If we wanted to reuse love_lines, we'd cache it—we'll cover that in Session 2."

Speaker notes: The Spark UI at localhost:4040 is your best friend for debugging. Walk through a live demo if possible. Key things to show: 1. **Jobs tab**: Each action (count, collect, saveAsTextFile) creates one job. If you see 50 jobs when you expected 5, you're probably missing caching. 2. **Stages tab**: Click into a job to see its stages. The DAG visualization shows how stages connect. Hover over stages to see which operations are in each. 3. **Storage tab**: Shows what's cached and how much memory it's using. If you called .cache() but nothing shows here, the RDD hasn't been computed yet (remember lazy evaluation!). 4. **Task duration**: If you see a few tasks taking 10× longer than others, you have data skew—some partitions are much larger. This is a performance killer we'll address with repartitioning. The UI only exists while SparkContext is active. Once your program exits, the UI disappears (unless you enable history server).

Speaker notes: Show this in the live Spark UI. Point out the single stage containing all three operations (textFile, filter, count). They're pipelined because they're all "narrow" transformations—no shuffle required. The number of tasks (2) matches the number of HDFS blocks.

Speaker notes: This is the critical point. Job 1 re-reads the file from HDFS and reapplies all the filters because nothing was cached. Each action triggers a complete re-execution from the source RDD. This is why caching is crucial for iterative algorithms and reused datasets—we'll cover that in Session 2. Students often assume Spark "remembers" intermediate results, but it doesn't unless you explicitly cache.

Pause here and ask students: "How many times does Spark read the data?" Then reveal: **Without lazy eval:** Would need 3 separate passes: - Pass 1: Read 1TB to apply first filter - Pass 2: Read filtered data to apply second filter - Pass 3: Read double-filtered data to count - Total: ~3TB of I/O **With lazy eval:** Spark fuses the operations: - Single pass: Read 1TB once, apply both filters inline, count - Total: 1TB of I/O **Result:** 3× less I/O from a simple optimization. This is why transformations are lazy—Spark can see the entire pipeline and optimize it globally.

Apache Spark: Architecture & RDD Fundamentals

CS 6500 — Week 5, Session 1

Week 4 Recap & Assignment 1 Check-In

Hadoop Ecosystem Review

The MapReduce Pain Points

Enter Apache Spark

Activity: Explore Spark and RDDs

The Spark Ecosystem

Spark vs. MapReduce: The Numbers

When to Use Spark vs. MapReduce

Spark Architecture: The Big Picture

Architecture Components

SparkContext: Your Entry Point

Job Execution: The Hierarchy

The DAG Execution Model

Lazy Evaluation: Spark's Secret Weapon

Data Locality in Spark

What Is an RDD?

Resilient Distributed Dataset

Five Properties of Every RDD

Creating RDDs

Transformations vs. Actions: The Key Distinction

Transformations (Lazy)

Actions (Eager)

Common Transformations

Common Actions

Live Demo: Your First Spark Program

Live Demo: Text File Processing

Spark UI: Your Debugging Dashboard

DAG Visualization in Spark UI

DAG Visualization: Multiple Actions

Activity: Transformation vs. Action Quiz

Activity Answers

Discussion: Why Does Lazy Evaluation Matter?

Key Takeaways

Preview: Session 2

References