Converting Algorithms to MapReduce

Phase	Task 1	Task 2
Step 1 shuffle	O(N)	O(√N)
Step 2 shuffle	—	O(√N)
Total	O(N)	O(√N)

Speaker context: This session teaches students how to think algorithmically about MapReduce problems. We take a simple task (calculating mean) and explore three different approaches with increasing efficiency. Students learn to translate "what computation do I need?" into "what does my mapper do?" and "what does my reducer do?". Using real data (gas sensors dataset) keeps it grounded.

Speaker notes: Frame this as demystifying MapReduce. Once students see how to break down a real problem (mean calculation), they can apply the pattern to any aggregation, join, or transformation task.

Speaker notes: - Emphasize "translation" not "magic". Mapper asks "what facts do I extract?". Reducer asks "how do I combine them?". - Trade-offs: Task 1 = simple but big storage; Task 2 = complex but O(√N) storage; Task 3 = all columns efficiently. - This is the mental model students apply to Spark, Hive, and beyond.

Converting Algorithms to MapReduce

CS 6500 — Week 4, Session 1

Why This Matters

Today's Targets

The Dataset

Task 1: One-Step Mean

The Simple Approach

Task 1: Mapper (mrjob)

Task 1: Reducer (mrjob)

Task 2: Two-Step Mean (Efficient)

Partition → Aggregate → Merge

Task 2, Step 1: Mapper (mrjob)

Task 2, Step 1: Reducer (mrjob)

Task 2, Step 2: Mapper (mrjob)

Task 2, Step 2: Reducer (mrjob)

Task 2 Storage Comparison

Task 3: All Columns Mean (Stretch Goal)

Efficient Multi-Column Aggregation

Task 3: Mapper (mrjob)

Task 3: Reducer (mrjob)

Key Takeaways

Activity: Redesign a Problem (15 min)

Debrief Prompts

Wrap-Up & Next Steps