Writing MapReduce Programs with mrjob

Mode	Command	Use Case
local	`python job.py file.txt`	Testing
hadoop	`python job.py -r hadoop hdfs://...`	Cluster
emr	`python job.py -r emr s3://...`	AWS
inline	`python job.py -r inline file.txt`	Single-process

Speaker context: This session transitions from MapReduce theory to hands-on coding practice using mrjob, a Pythonic library that simplifies MapReduce development. The audience includes students from CS, business, and math backgrounds—all with SQL experience but varying programming comfort levels. We'll leverage their SQL intuition (GROUP BY, aggregation) while introducing mrjob's elegant class-based approach. Emphasize local testing workflows to build confidence before cluster submission.

Speaker notes: Many students with business/math backgrounds are more comfortable with Python than Java. mrjob provides a Pythonic API that feels natural while hiding Hadoop complexity. Emphasize portability—write once, run anywhere.

Speaker notes: Show live installation. Mention that Docker environment has mrjob pre-installed. For students' personal machines, they can use pip install --user if they lack admin rights.

Speaker notes: Live code wordcount.py step-by-step. Start with basic class structure, add mapper, then reducer. Emphasize how much cleaner this is than stdin/stdout scripts. For business students, relate class methods to defining functions in Excel VBA or SQL procedures.

Speaker notes: Live demo YARN UI navigation. Show completed job counters. For business students, relate to SQL query execution plans or ETL pipeline dashboards.

Speaker notes: Critical hands-on demo. Go slowly, explain each step before executing. Pause to let students catch up. Point out expected timing (small file = 30-60 seconds). Show mrjob's progress output in terminal.

Writing MapReduce Programs with mrjob

CS 6500 — Week 3, Session 2

Today's Learning Journey

Why mrjob?

mrjob Philosophy

Installation

Live Coding Demo: Word Count with mrjob

Word Count: Mapper

Word Count: Reducer

mrjob Mapper: Key Details

mrjob Reducer Pattern

Local Testing Workflow

mrjob Output Formats

Common Local Testing Tips

Running on Hadoop Cluster: Step 1a

Running on Hadoop Cluster: Step 1b

Running on Hadoop Cluster: Step 2

Running on Hadoop Cluster: Step 2b

mrjob Runner Modes

Monitoring Jobs: YARN ResourceManager UI

Live Demo: End-to-End mrjob Execution

Viewing Results

Debugging mrjob Jobs

Common mrjob Debugging Scenarios

Challenge: Convert SQL to MapReduce

Solution: Customer Purchase Analysis

Activity: Pair Programming Challenge

Activity: Solution Review

Challenge: Calculate Average Temperature by City

Solution: Computing Averages in MapReduce