Input line:
to be or not to be
Mapper emits:
(to, 1)
(be, 1)
(or, 1)
(not, 1)
(to, 1)
(be, 1)
Shuffle groups by key:
be: [1, 1]
not: [1]
or: [1]
to: [1, 1]
Reducer outputs:
be 2
not 1
or 1
to 2

HDFS block replicas + scheduler awareness enable data-local task placement:
Practical consequence: On a 1PB dataset with replication=3:
This is not a micro-optimization; it's a 10× difference in feasibility.
The scheduler evaluates task placement in priority order:
Default behavior: wait up to 5 minutes for local placement before degrading to ANY.
Question: When would you disable local scheduling? (Answer: never on commodity clusters; perhaps in cloud with elastic resources and fast networks.)
Each mapper accumulates emitted (K₂, V₂) pairs in a circular buffer (typically 100MB).
When buffer fills to threshold (default 80%):
Multiple spills per mapper → final merge step combines spill files.
Key trade-off: Smaller buffer = more frequent spills = more disk I/O; larger buffer = risk of OOM.
Example: 10GB intermediate, 100MB buffer, 80% threshold
→ ~125 spill files.
Reducers begin fetching map outputs before all mappers finish:
Critical parameter: io.sort.factor — max number of files merged in one pass.
Example: 100 spill files, sort.factor=10 → 2 passes (merge 10→10 files, then 10→1).
Choice of wire format affects CPU time and bytes moved.
Mapper emits 100GB intermediate.
Snappy compresses to 30GB.
Shuffle transfers 50GB after merge-side dedup.
Question: Is snappy worth the CPU cost if mappers run at 80% utilization?
Hash partitioner (default): partition(K) = hash(K) mod R
Grouping comparator: Defines when two consecutive keys are "equal" for grouping.
(year, month) pairs; group by year, sort by (year, month)Secondary sort: composite key + grouping comparator
→ sorted streams per key inside each reducer.
# Sort key: (user_id, count DESC)
# Grouping: user_id only
# Reducer sees items sorted by count
Problem: Heavy-tailed key distributions cause reducer load imbalance.
Example: Zipf distribution (many queries have a few popular terms):
When to apply: If key frequency CV > 2, skew is likely dominant.
Core insight: Tasks are deterministic functions of their input split.
→ If a task fails, re-run on same (or replicated) input → identical output.
Requirement: Map and reduce functions must be free of side effects.
Launch duplicate task if one runs slowly. First to finish wins.
Trade-off: Reduces tail latency increases network load.
Disable when shuffle is bottleneck.
Where:
Combiners reduce
Enable Hadoop metrics:
# In job logs, find lines like:
# Map input records = X
# Map output records = Y (expect Y > X for fan-out, Y ≈ X for filtering)
# Spilled records = Z (expect Z >> Y if Z >> Y, spill is happening)
# Combine input records = ...
Compute your own:
When MapReduce excels:
When MapReduce is overkill or wrong:
Understand the design constraints and trade-offs. MapReduce isn't "big data 101"—it's a specific architecture for a specific problem class.
One-pass batch analytics on commodity clusters. When you have that problem, MapReduce is unbeatable. When you don't, use the right tool.
Problem: Count occurrences of each word in a corpus.
Mapper: (word, 1) per token
(the, 1), (quick, 1), (brown, 1), ...
Reducer: Sum counts per word
(the, 523), (quick, 15), (brown, 8), ...
Why it works: Aggregation is embarrassingly parallel; combiner reduces shuffle volume.
Real-world variant: Count ICD-10 diagnostic codes per patient (CKD example).
Problem: Build an inverted index (term → documents containing it).
Mapper: Emit (term, docid) for each term in document
doc1: "hadoop distributed computing"
→ (hadoop, doc1), (distributed, doc1), (computing, doc1)
Reducer: Group all docids per term
(hadoop, [doc1, doc3, doc7, ...])
(distributed, [doc1, doc2, ...])
Why it's harder: Shuffle groups values; must handle many values per key (memory efficiency). Combiner doesn't help (no reduction possible).
Complexity: O(n log n) for sort; reducer becomes I/O bottleneck with skewed terms.
Problem: PageRank or shortest path on a graph.
MapReduce iteration (single pass):
rank(u) / degree(u) → v
rank'(v) = 0.15 + 0.85 × Σ contributions
Challenges:
Better approach: Spark or GraphLab (designed for iterative algorithms).
Problem: Cluster n-dimensional points into k clusters (requires iteration).
Multi-round MapReduce (Mahout approach):
Round t:
point(x, y) → (nearest_centroid_id, (x, y))
(cluster_id, [points]) → (cluster_id, new_centroid)
Why it's "very hard":
Per-round cost: Full shuffle of all n points, I/O at every stage, network overhead.
Real example: K-means on 1GB with 15 iterations
Lesson: Iterative algorithms strain MapReduce's one-pass design.
Better approach: Spark MLlib (in-memory RDDs, sub-second convergence checks).
Speaker context: This slide deck positions MapReduce not as a simplistic "easy way to process big data" but as a principled solution to the distributed systems challenge of scalable analytics on commodity clusters. We'll interrogate design choices, trade-offs, and performance implications. Students should emerge understanding *why* the framework is architected as it is, not merely *how* to use it.
Speaker notes: Use this diagram to reinforce the word-count walkthrough with a full pipeline view. Emphasize the transition from map output to shuffle grouping and reducer aggregation. Point out where data locality matters and where network costs dominate (shuffle).
Speaker notes: Use actual numbers. "Let's compute: 1PB / 1000 mappers = 1TB per mapper. If local reads are 100 MB/sec (disk bound), that's 10,000 seconds = 3 hours. If we force remote reads at 10 MB/sec (network bound), that's 100,000 seconds = 27 hours. Plus network contention means all 1000 mappers compete for bandwidth."
Speaker notes: Explain why YARN has this logic. "The framework makes the trade-off: delay task start by a few seconds to ensure locality, rather than launch immediately on a random node incurring network cost for every input block read."
Speaker notes: Walk through the math. "If your mapper emits 10GB of data but your sort buffer is 100MB at 80% threshold, you get (10GB)/(100MB × 0.8) ≈ 125 spill files. Each spill disk write is ~100MB, so that's 12.5GB of disk I/O just to buffer data. This is why combiner matters—it reduces intermediate size."
Expected reasoning: "If mappers are already CPU-bound, snappy adds latency. If shuffle is network-bottleneck, snappy saves time overall." Point out: depends on cluster characteristics (network speed, CPU clock).
Speaker notes: "Salting example: if 'the' is 1M occurrences out of 1B tokens, hash it randomly to 10 buckets, so each receives 100K. Then union results afterward. Cost: 2x the job time, but now parallelism is restored."