Speaker context: This session marks the pivot from MapReduce to Spark. Students have just spent two weeks writing MapReduce jobs and have felt the pain of disk-based intermediate storage, verbose code, and iterative-algorithm limitations. We'll use that pain to motivate Spark's design decisions. The goal isn't just "Spark is faster"—it's understanding *why* in-memory DAG execution changes what's feasible at scale. By end of session, students should be able to explain Spark's architecture and write basic RDD operations.
Speaker notes: Quick visual reminder of the Hadoop ecosystem we've been working with. Point out HDFS (storage), YARN (resource management), and MapReduce (processing). Today we're adding Spark as an alternative processing engine that also runs on YARN and uses HDFS for storage.
Speaker notes: Let students articulate the pain. Ask: "What was the most frustrating thing about writing MapReduce?" Collect 3-4 answers. Then transition: "Spark was designed to solve exactly these problems."
Speaker notes: This is a discovery activity before formal instruction. Students learn better when they've tried to figure something out first. Circulate and listen to conversations. After 10 minutes, collect 2-3 answers from the class for each question. Their answers will likely be surface-level—that's perfect. Use their responses as hooks: "Great! Now let me show you what's really happening under the hood..."
Speaker notes: Emphasize that Spark doesn't violate physics. "The 100× speedup is specifically for iterative algorithms where MapReduce re-reads from disk every iteration. For single-pass jobs, the speedup is more like 2-5×. Understanding *when* Spark wins is as important as knowing *that* it wins."
Speaker notes: Expand on each component as you present. Driver is like the conductor of an orchestra—it doesn't play the instruments (process data), it coordinates. Executors are the musicians. Cluster manager is the venue that provides the stage and resources. Walk through a simple example: "When you call collect(), the driver sends tasks to executors, they process their partitions, and send results back."
Speaker notes: Emphasize the "local[*]" option for development—it uses all your laptop cores and is great for testing before moving to the cluster. The "one SparkContext per JVM" rule catches people off guard—if you try to create a second one, you'll get an error. In Jupyter notebooks, this means you need to restart the kernel if you want to reconfigure. In Spark 2.0+, SparkSession is preferred and handles this more gracefully, but we're starting with SparkContext to understand the fundamentals.
Speaker notes: This hierarchy is crucial for understanding Spark UI. Walk through the example step by step: "textFile, flatMap, and map can all happen in one stage because they're 'narrow'—each input partition maps to exactly one output partition. But reduceByKey is 'wide'—it needs to shuffle data across the cluster to group by key. That shuffle creates a stage boundary." Draw this on the board: show how Stage 1 outputs are shuffled to Stage 2 inputs. The number of tasks in each stage equals the number of partitions at that point in the DAG.
Speaker notes: Draw on whiteboard. "MapReduce is like a fixed assembly line—two stations only. Spark is like a flexible factory—it designs the assembly line to fit your specific product."
Speaker notes: RDD = Resilient Distributed Dataset. Let's unpack each word:
**Resilient** means fault-tolerant. If a worker node crashes and loses a partition, Spark can recompute just that partition using the lineage graph (the chain of transformations that created it). No need to recompute everything.
**Distributed** means the data is partitioned across multiple nodes in the cluster. A 1TB RDD might have 1000 partitions of 1GB each, spread across 100 nodes.
**Dataset** is just a collection of records. Unlike a database table, records can be any type: strings, integers, tuples, custom Python objects, even other RDDs.
Think of an RDD as an immutable, distributed Python list that remembers how it was created. That "memory" (lineage) is what enables fault tolerance without expensive replication.
1. **Partitions** — how data is split across the cluster
2. **Compute function** — how to compute each partition
3. **Dependencies** — which parent RDD(s) it depends on
4. **Partitioner** — how keys are distributed (for pair RDDs)
5. **Preferred locations** — where to place computation (data locality)
Speaker notes: Run this live in Jupyter. After collect(), switch to Spark UI (localhost:4040) and show the job, stages, and tasks. Point out that filter and map are in the same stage (narrow transforms, pipelined). Ask: "How many tasks?" → depends on number of partitions.
Speaker notes: After running, go to Spark UI. Show that count() triggered one job, but the second count() triggers another. "Each action = new job. If we wanted to reuse love_lines, we'd cache it—we'll cover that in Session 2."
Speaker notes: The Spark UI at localhost:4040 is your best friend for debugging. Walk through a live demo if possible. Key things to show:
1. **Jobs tab**: Each action (count, collect, saveAsTextFile) creates one job. If you see 50 jobs when you expected 5, you're probably missing caching.
2. **Stages tab**: Click into a job to see its stages. The DAG visualization shows how stages connect. Hover over stages to see which operations are in each.
3. **Storage tab**: Shows what's cached and how much memory it's using. If you called .cache() but nothing shows here, the RDD hasn't been computed yet (remember lazy evaluation!).
4. **Task duration**: If you see a few tasks taking 10× longer than others, you have data skew—some partitions are much larger. This is a performance killer we'll address with repartitioning.
The UI only exists while SparkContext is active. Once your program exits, the UI disappears (unless you enable history server).
Speaker notes: Show this in the live Spark UI. Point out the single stage containing all three operations (textFile, filter, count). They're pipelined because they're all "narrow" transformations—no shuffle required. The number of tasks (2) matches the number of HDFS blocks.
Speaker notes: This is the critical point. Job 1 re-reads the file from HDFS and reapplies all the filters because nothing was cached. Each action triggers a complete re-execution from the source RDD. This is why caching is crucial for iterative algorithms and reused datasets—we'll cover that in Session 2. Students often assume Spark "remembers" intermediate results, but it doesn't unless you explicitly cache.
Pause here and ask students: "How many times does Spark read the data?" Then reveal:
**Without lazy eval:** Would need 3 separate passes:
- Pass 1: Read 1TB to apply first filter
- Pass 2: Read filtered data to apply second filter
- Pass 3: Read double-filtered data to count
- Total: ~3TB of I/O
**With lazy eval:** Spark fuses the operations:
- Single pass: Read 1TB once, apply both filters inline, count
- Total: 1TB of I/O
**Result:** 3× less I/O from a simple optimization. This is why transformations are lazy—Spark can see the entire pipeline and optimize it globally.