MapReduce: Design Rationale & Implementation Fundamentals

Speaker context: This slide deck positions MapReduce not as a simplistic "easy way to process big data" but as a principled solution to the distributed systems challenge of scalable analytics on commodity clusters. We'll interrogate design choices, trade-offs, and performance implications. Students should emerge understanding *why* the framework is architected as it is, not merely *how* to use it.

Speaker notes: Use this diagram to reinforce the word-count walkthrough with a full pipeline view. Emphasize the transition from map output to shuffle grouping and reducer aggregation. Point out where data locality matters and where network costs dominate (shuffle).

Speaker notes: Use actual numbers. "Let's compute: 1PB / 1000 mappers = 1TB per mapper. If local reads are 100 MB/sec (disk bound), that's 10,000 seconds = 3 hours. If we force remote reads at 10 MB/sec (network bound), that's 100,000 seconds = 27 hours. Plus network contention means all 1000 mappers compete for bandwidth."

Speaker notes: Explain why YARN has this logic. "The framework makes the trade-off: delay task start by a few seconds to ensure locality, rather than launch immediately on a random node incurring network cost for every input block read."

Speaker notes: Walk through the math. "If your mapper emits 10GB of data but your sort buffer is 100MB at 80% threshold, you get (10GB)/(100MB × 0.8) ≈ 125 spill files. Each spill disk write is ~100MB, so that's 12.5GB of disk I/O just to buffer data. This is why combiner matters—it reduces intermediate size."

Expected reasoning: "If mappers are already CPU-bound, snappy adds latency. If shuffle is network-bottleneck, snappy saves time overall." Point out: depends on cluster characteristics (network speed, CPU clock).

Speaker notes: "Salting example: if 'the' is 1M occurrences out of 1B tokens, hash it randomly to 10 buckets, so each receives 100K. Then union results afterward. Cost: 2x the job time, but now parallelism is restored."

MapReduce: Design Rationale & Implementation Fundamentals

CS 6500 — Week 3, Session 1

MapReduce as Architectural Answer

The MapReduce Data Flow (Abstract)

Data Flow: Guarantees & Costs

Word Count Walkthrough (Input → Map)

Word Count Walkthrough (Shuffle → Reduce)

MapReduce Example: End-to-End Flow

Locality: The First-Order Optimization

Data Locality: Scheduling Tiers

Shuffle Internals: Map-Side Buffer Management

Map-Side Spill Cost (Back-of-the-Envelope)

Reduce-Side Shuffle: Pipelined Fetch & Merge

Serialization & Compression: CPU-Network Trade-off

Compression Decision Check

Partitioning & Grouping: Correctness Guarantees

Secondary Sort Example (Top-K per User)

Skew: The Hidden Killer of MapReduce

Skew Mitigation Toolkit

Fault Tolerance: Design Principle

Speculative Execution Trade-off

Cost Model: Reasoning About Performance

Cost Model: Example

What to Measure on Your Job

MapReduce is Not a Panacea

The Lesson

Example 1: Easy — Word Count (Aggregation)

Example 2: Medium — Inverted Index (Join-like)

Example 3: Hard — Graph Algorithms (Iterative)

Example 3 (continued): Why It's Hard

Example 4: Very Hard — K-means Clustering (Multi-Round)

Example 4 (continued): Cost of Iteration

References