Apache Spark: Architecture & RDD Fundamentals

CS 6500 — Week 5, Session 1

Week 4 Recap & Assignment 1 Check-In

MapReduce Patterns You've Mastered:

  • Mapper → Shuffle/Sort → Reducer pipeline
  • Combiners for local pre-aggregation
  • Custom partitioning and secondary sort
  • mrjob for Pythonic MapReduce development

Assignment 1 Status:

  • Common issues? Office hours available this week
  • Quick poll: Who's finished? Who needs help?

Hadoop Ecosystem Review

The MapReduce Pain Points

Think back to your MapReduce experience:

  1. Disk I/O at every stage — map output → disk → shuffle → disk → reduce
  2. Verbose code — 50+ lines for word count
  3. No interactivity — submit job, wait, check output, repeat
  4. Iterative algorithms are painful — each iteration = full MapReduce job
  5. No in-memory caching — re-read input for every query

Question: What if we could keep data in memory across operations?

Enter Apache Spark

Origin: UC Berkeley AMPLab (2009) → Apache project (2013) → Databricks (2013+)

Core Insight: Keep intermediate data in memory instead of writing to disk

Result:

  • 10–100× faster than MapReduce for iterative algorithms
  • 5–10× faster for one-pass batch jobs
  • Interactive queries in seconds (not minutes)

Adoption: Netflix, Uber, Airbnb, NASA, CERN — virtually every large-scale data org

Activity: Explore Spark and RDDs

Time: 10 Minutes

Task: In pairs, research and answer:

  1. What problem does Spark solve that MapReduce doesn't?
  2. What does "RDD" stand for and what is it?
  3. Find one real-world use case of Spark

Resources:

  • Apache Spark website
  • Your favorite search engine
  • Discuss with your partner

Be ready to share your findings!

The Spark Ecosystem

┌─────────────────────────────────────────────┐
│              Spark Applications             │
├──────────┬──────────┬──────────┬────────────┤
│ Spark SQL│Streaming │  MLlib   │  GraphX    │
│DataFrames│Structured│ Machine  │   Graph    │
│          │Streaming │ Learning │ Processing │
├──────────┴──────────┴──────────┴────────────┤
│              Spark Core (RDDs)              │
├─────────────────────────────────────────────┤
│     YARN  /  Mesos  /  Kubernetes / Local   │
└─────────────────────────────────────────────┘

Today's focus: Spark Core and RDDs — the foundation for everything above

Spark vs. MapReduce: The Numbers

Metric MapReduce Spark
Word count (10GB) ~8 min ~2 min
K-means (10 iterations, 100GB) ~180 min ~12 min
Interactive query (cached 50GB) ~5 min/query ~3 sec/query
Lines of code (word count) 50+ 5
Intermediate storage Disk (HDFS) Memory (RAM)

Key takeaway: The speedup is not magic — it's eliminating disk I/O

When to Use Spark vs. MapReduce

Use Spark When... Stick with MapReduce When...
Iterative algorithms (ML, graph) Simple one-pass ETL
Interactive exploration Very memory-constrained clusters
Multi-step pipelines Legacy Hadoop-only infrastructure
Real-time / near-real-time Ultra-reliable batch (proven track record)
You want concise code Existing MapReduce codebase

Reality: Most organizations have moved to Spark. MapReduce understanding helps you appreciate why.

Spark Architecture: The Big Picture

┌────────────────────────────────────────────┐
│                DRIVER PROGRAM              │
│  ┌──────────────────────────────────────┐  │
│  │         SparkContext / Session       │  │
│  └──────────────┬───────────────────────┘  │
└─────────────────┼──────────────────────────┘
                  │ Submits tasks
    ┌─────────────┼─────────────┐
    │             │             │
┌───▼────┐   ┌────▼────┐   ┌───▼────┐
│Executor│   │Executor │   │Executor│
│ Task   │   │ Task    │   │ Task   │
│ Task   │   │ Task    │   │ Task   │
│ Cache  │   │ Cache   │   │ Cache  │
└────────┘   └─────────┘   └────────┘
   Worker        Worker        Worker

Architecture Components

Driver Program:

  • Main application (Python script, notebook)
  • Creates SparkContext and builds DAG
  • Schedules tasks, collects results

Executors:

  • Run tasks on worker nodes
  • Cache data in memory

Cluster Manager:

  • Allocates resources (YARN, Mesos, K8s)

SparkContext: Your Entry Point

from pyspark import SparkContext

# Create SparkContext
sc = SparkContext("local[4]", "MyApp")
#                 ^^^^^^^^    ^^^^^^
#                 master URL  app name

Master URL options:

  • "local" — 1 thread (debugging)
  • "local[4]" — 4 threads (local parallelism)
  • "local[*]" — all available cores
  • "yarn" — YARN cluster (production)

Rule: One SparkContext per JVM. Create once, use everywhere.

Job Execution: The Hierarchy

Application (your program)
  └── Job (triggered by each action)
       └── Stage (separated by shuffle boundaries)
            └── Task (one per partition per stage)

Example: sc.textFile(...).flatMap(...).map(...).reduceByKey(...).collect()

  • 1 Job (triggered by collect())
  • 2 Stages (split at reduceByKey — requires shuffle)
  • Stage 1: textFile → flatMap → map (narrow transforms, pipelined)
  • Stage 2: reduceByKey → collect (after shuffle)

The DAG Execution Model

DAG = Directed Acyclic Graph of transformations

textFile ──→ flatMap ──→ map ──→ reduceByKey ──→ collect
  (Stage 1: pipelined)     │      (Stage 2)
                     shuffle boundary

Why DAG matters:

  • Spark sees the entire pipeline before executing
  • Can optimize: pipeline narrow transforms, skip unnecessary computation
  • MapReduce: fixed 2-phase model (map → reduce), no global optimization

Lazy Evaluation: Spark's Secret Weapon

Transformations are lazy — they build a plan but don't execute

rdd1 = sc.textFile("data.txt")      # Nothing happens
rdd2 = rdd1.filter(lambda x: ...)   # Nothing happens
rdd3 = rdd2.map(lambda x: ...)      # Nothing happens
result = rdd3.count()                # NOW everything executes!

Why lazy?

  1. Optimization: Spark can fuse operations, push filters early
  2. Efficiency: Only compute what's needed for the action
  3. Fault tolerance: Lineage graph enables recomputation

Common confusion: "Why doesn't my print show anything?"
→ Because map() is a transformation, not an action!

Data Locality in Spark

Spark inherits HDFS data locality:

Level Description Speed
PROCESS_LOCAL Data in executor's memory Fastest
NODE_LOCAL Data on same node's disk Fast
RACK_LOCAL Data on same rack Moderate
ANY Data on remote rack Slowest

Spark scheduler delays task launch (up to 3 sec) to achieve better locality.

Takeaway: Same principle as MapReduce — move computation to data, not data to computation.

What Is an RDD?

Resilient Distributed Dataset

  • Resilient
  • Distributed
  • Dataset

Five Properties of Every RDD

Every RDD knows:

  1. Partitions
  2. Compute function
  3. Dependencies
  4. Partitioner
  5. Preferred locations

These five properties are all Spark needs for execution and fault tolerance.

Creating RDDs

From a Python collection:

data = [1, 2, 3, 4, 5]
rdd = sc.parallelize(data, numSlices=4)  # 4 partitions

From a file:

rdd = sc.textFile("hdfs:///user/student/data.txt")
# One record per line

From another RDD (transformation):

rdd2 = rdd.filter(lambda x: x > 3)

Transformations vs. Actions: The Key Distinction

Transformations (Lazy)

Create a new RDD from an existing one. Nothing executes.

Actions (Eager)

Trigger computation and return a result to the driver or write to storage.

Transformations: Build the plan
Actions: Execute the plan

This is the single most important concept in Spark programming.

Common Transformations

Transformation Description Example
map(f) Apply f to each element rdd.map(lambda x: x * 2)
filter(f) Keep elements where f is True rdd.filter(lambda x: x > 5)
flatMap(f) Map + flatten results rdd.flatMap(lambda x: x.split())
distinct() Remove duplicates rdd.distinct()
union(other) Combine two RDDs rdd1.union(rdd2)
sample(frac) Random sample rdd.sample(False, 0.1)

All return new RDDs. None trigger execution.

Common Actions

Action Description Returns
count() Number of elements Integer
collect() All elements to driver List
take(n) First n elements List
first() First element Element
reduce(f) Aggregate all elements Value
saveAsTextFile(path) Write to HDFS None
foreach(f) Apply f (side effects) None

⚠️ Warning: collect() brings ALL data to driver. On large RDDs → Out of Memory!

Live Demo: Your First Spark Program

Let's launch PySpark and explore:

from pyspark import SparkContext

sc = SparkContext("local[4]", "FirstDemo")

# Create RDD from a list
numbers = sc.parallelize([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Transformations (lazy — nothing happens yet)
evens = numbers.filter(lambda x: x % 2 == 0)
squared = evens.map(lambda x: x ** 2)

# Action (triggers execution!)
result = squared.collect()
print(result)  # [4, 16, 36, 64, 100]

Live Demo: Text File Processing

# Load text file from HDFS
lines = sc.textFile("hdfs:///user/student/shakespeare.txt")

# How many lines?
print(f"Total lines: {lines.count()}")

# Filter lines containing "love"
love_lines = lines.filter(lambda l: "love" in l.lower())
print(f"Lines with 'love': {love_lines.count()}")

# Find the longest line
longest = lines.reduce(
    lambda a, b: a if len(a) > len(b) else b
)
print(f"Longest line: {longest[:80]}...")

Spark UI: Your Debugging Dashboard

Access: http://localhost:4040 (while SparkContext is active)

Key tabs:

  • Jobs — list of all actions triggered
  • Stages — breakdown of each job into stages
  • Storage — what's cached in memory
  • Executors — resource usage per executor

What to look for:

  • Number of tasks per stage (= number of partitions)
  • Stage boundaries (= shuffle points)
  • Task duration distribution (skew = some tasks much slower)

DAG Visualization in Spark UI

Job 0: First action

textFile → filter → count()
  Stage 0: 2 tasks

Execution:

  • Read file from HDFS
  • Apply filter
  • Count results
  • 2 tasks (2 HDFS blocks = 2 partitions)

DAG Visualization: Multiple Actions

Job 1: Second action

textFile → filter → filter → count()
  ⚠ Re-reads file!

Without caching, Spark re-executes the lineage for each action.

Activity: Transformation vs. Action Quiz

Classify each operation as Transformation (T) or Action (A):

# Operation T or A?
1 map()
2 count()
3 filter()
4 collect()
5 reduceByKey()
6 saveAsTextFile()

Time: 2 minutes individual, discuss with partner

Activity Answers

# Operation Answer
1 map() T
2 count() A
3 filter() T
4 collect() A
5 reduceByKey() T
6 saveAsTextFile() A

Heuristic: Returns an RDD → T. Returns data or writes output → A.

Discussion: Why Does Lazy Evaluation Matter?

Question: What if we need to:

  1. Load 1TB of logs
  2. Filter for "ERROR" lines
  3. Filter for "timeout" in those errors
  4. Count them
logs = sc.textFile("hdfs:///logs/")  # 1TB
timeouts = logs.filter(lambda l: "ERROR" in l)\
               .filter(lambda l: "timeout" in l.lower())
result = timeouts.count()

Key Takeaways

  1. Spark keeps data in memory — eliminates MapReduce disk I/O bottleneck
  2. RDDs are immutable, distributed collections — fault-tolerant via lineage
  3. Transformations are lazy — build a DAG; actions trigger execution
  4. DAG optimization — Spark pipelines narrow transforms, optimizes the plan
  5. Spark UI — essential tool for understanding what Spark actually does

Preview: Session 2

Next time — Hands-on RDD programming:

  • Key-Value Pair RDDs and pair operations
  • Word count in Spark (compare to MapReduce!)
  • reduceByKey vs. groupByKey (critical performance choice)
  • Caching and persistence strategies
  • Partitioning for performance
  • Spark RDD homework introduction

Come ready to code!

References

  • Zaharia et al. (2012): "Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing"
  • Learning Spark, 2nd Edition (Chapters 1–3)
  • Apache Spark Documentation: RDD Programming Guide
  • Spark UI Documentation: Monitoring and Instrumentation

Speaker context: This session marks the pivot from MapReduce to Spark. Students have just spent two weeks writing MapReduce jobs and have felt the pain of disk-based intermediate storage, verbose code, and iterative-algorithm limitations. We'll use that pain to motivate Spark's design decisions. The goal isn't just "Spark is faster"—it's understanding *why* in-memory DAG execution changes what's feasible at scale. By end of session, students should be able to explain Spark's architecture and write basic RDD operations.

Speaker notes: Quick visual reminder of the Hadoop ecosystem we've been working with. Point out HDFS (storage), YARN (resource management), and MapReduce (processing). Today we're adding Spark as an alternative processing engine that also runs on YARN and uses HDFS for storage.

Speaker notes: Let students articulate the pain. Ask: "What was the most frustrating thing about writing MapReduce?" Collect 3-4 answers. Then transition: "Spark was designed to solve exactly these problems."

Speaker notes: This is a discovery activity before formal instruction. Students learn better when they've tried to figure something out first. Circulate and listen to conversations. After 10 minutes, collect 2-3 answers from the class for each question. Their answers will likely be surface-level—that's perfect. Use their responses as hooks: "Great! Now let me show you what's really happening under the hood..."

Speaker notes: Emphasize that Spark doesn't violate physics. "The 100× speedup is specifically for iterative algorithms where MapReduce re-reads from disk every iteration. For single-pass jobs, the speedup is more like 2-5×. Understanding *when* Spark wins is as important as knowing *that* it wins."

Speaker notes: Expand on each component as you present. Driver is like the conductor of an orchestra—it doesn't play the instruments (process data), it coordinates. Executors are the musicians. Cluster manager is the venue that provides the stage and resources. Walk through a simple example: "When you call collect(), the driver sends tasks to executors, they process their partitions, and send results back."

Speaker notes: Emphasize the "local[*]" option for development—it uses all your laptop cores and is great for testing before moving to the cluster. The "one SparkContext per JVM" rule catches people off guard—if you try to create a second one, you'll get an error. In Jupyter notebooks, this means you need to restart the kernel if you want to reconfigure. In Spark 2.0+, SparkSession is preferred and handles this more gracefully, but we're starting with SparkContext to understand the fundamentals.

Speaker notes: This hierarchy is crucial for understanding Spark UI. Walk through the example step by step: "textFile, flatMap, and map can all happen in one stage because they're 'narrow'—each input partition maps to exactly one output partition. But reduceByKey is 'wide'—it needs to shuffle data across the cluster to group by key. That shuffle creates a stage boundary." Draw this on the board: show how Stage 1 outputs are shuffled to Stage 2 inputs. The number of tasks in each stage equals the number of partitions at that point in the DAG.

Speaker notes: Draw on whiteboard. "MapReduce is like a fixed assembly line—two stations only. Spark is like a flexible factory—it designs the assembly line to fit your specific product."

Speaker notes: RDD = Resilient Distributed Dataset. Let's unpack each word: **Resilient** means fault-tolerant. If a worker node crashes and loses a partition, Spark can recompute just that partition using the lineage graph (the chain of transformations that created it). No need to recompute everything. **Distributed** means the data is partitioned across multiple nodes in the cluster. A 1TB RDD might have 1000 partitions of 1GB each, spread across 100 nodes. **Dataset** is just a collection of records. Unlike a database table, records can be any type: strings, integers, tuples, custom Python objects, even other RDDs. Think of an RDD as an immutable, distributed Python list that remembers how it was created. That "memory" (lineage) is what enables fault tolerance without expensive replication.

1. **Partitions** — how data is split across the cluster 2. **Compute function** — how to compute each partition 3. **Dependencies** — which parent RDD(s) it depends on 4. **Partitioner** — how keys are distributed (for pair RDDs) 5. **Preferred locations** — where to place computation (data locality)

Speaker notes: Run this live in Jupyter. After collect(), switch to Spark UI (localhost:4040) and show the job, stages, and tasks. Point out that filter and map are in the same stage (narrow transforms, pipelined). Ask: "How many tasks?" → depends on number of partitions.

Speaker notes: After running, go to Spark UI. Show that count() triggered one job, but the second count() triggers another. "Each action = new job. If we wanted to reuse love_lines, we'd cache it—we'll cover that in Session 2."

Speaker notes: The Spark UI at localhost:4040 is your best friend for debugging. Walk through a live demo if possible. Key things to show: 1. **Jobs tab**: Each action (count, collect, saveAsTextFile) creates one job. If you see 50 jobs when you expected 5, you're probably missing caching. 2. **Stages tab**: Click into a job to see its stages. The DAG visualization shows how stages connect. Hover over stages to see which operations are in each. 3. **Storage tab**: Shows what's cached and how much memory it's using. If you called .cache() but nothing shows here, the RDD hasn't been computed yet (remember lazy evaluation!). 4. **Task duration**: If you see a few tasks taking 10× longer than others, you have data skew—some partitions are much larger. This is a performance killer we'll address with repartitioning. The UI only exists while SparkContext is active. Once your program exits, the UI disappears (unless you enable history server).

Speaker notes: Show this in the live Spark UI. Point out the single stage containing all three operations (textFile, filter, count). They're pipelined because they're all "narrow" transformations—no shuffle required. The number of tasks (2) matches the number of HDFS blocks.

Speaker notes: This is the critical point. Job 1 re-reads the file from HDFS and reapplies all the filters because nothing was cached. Each action triggers a complete re-execution from the source RDD. This is why caching is crucial for iterative algorithms and reused datasets—we'll cover that in Session 2. Students often assume Spark "remembers" intermediate results, but it doesn't unless you explicitly cache.

Pause here and ask students: "How many times does Spark read the data?" Then reveal: **Without lazy eval:** Would need 3 separate passes: - Pass 1: Read 1TB to apply first filter - Pass 2: Read filtered data to apply second filter - Pass 3: Read double-filtered data to count - Total: ~3TB of I/O **With lazy eval:** Spark fuses the operations: - Single pass: Read 1TB once, apply both filters inline, count - Total: 1TB of I/O **Result:** 3× less I/O from a simple optimization. This is why transformations are lazy—Spark can see the entire pipeline and optimize it globally.