HDFS Operations and Programming

Gap	Solution	When
Parallel computation on HDFS blocks	MapReduce — data-local map + shuffle + reduce	Week 3
Python-friendly MapReduce	mrjob — write mappers/reducers as Python classes	Week 3, Session 2
Faster iteration, in-memory compute	Apache Spark — DAG execution over HDFS data	Week 5

Speaker context: This is a hands-on session. Minimize lecture, maximize student time at the terminal. The scavenger hunt activity should occupy ~20 minutes of productive struggle. The goal is that every student leaves having successfully uploaded, inspected, and retrieved a file from HDFS, and having run a Python snippet that touches HDFS programmatically. Confidence at the CLI is a prerequisite for everything we do in Weeks 3–8.

Speaker notes: Expected output: Found 3 items drwxr-xr-x - hadoop supergroup 0 2026-01-15 09:00 /data drwxr-xr-x - hadoop supergroup 0 2026-01-15 09:00 /tmp drwxr-xr-x - hadoop supergroup 0 2026-01-15 09:00 /user

Speaker notes: HDFS doesn't have a true current working directory. All paths must be absolute (e.g., /user/hadoop, not just hadoop).

Speaker notes: Expected output shows 3 directories: hw1/backup, hw1/input, hw1/output.

Speaker notes: Pro tip — `-put` will fail if the destination already exists. Add `-f` to overwrite: `hdfs dfs -put -f /tmp/hello.txt /user/student/hello.txt`

Speaker notes: `-du` shows 3× the file size because a 10MB file with replication=3 uses 30MB of raw disk across the cluster.

Speaker notes: Also available: df.write.orc("hdfs://namenode:9000/user/student/output/data.orc") — columnar, efficient for analytics.

Speaker notes: Save results back: errors.saveAsTextFile("hdfs://namenode:9000/user/student/errors_only")

Speaker notes: Verify with `hdfs dfs -ls /user/student/hw1` — should show 3 directories.

HDFS Operations and Programming

CS 6500 — Week 2, Session 2

Session 1 Recap

The Question

Today's Answer

HDFS CLI

Connect to Cluster

Navigation Commands

Make Directories

Uploading Files

Downloading Files

Inspecting Files

Block Health Check

File Operations

Space Commands

Admin Commands

HDFS Scavenger Hunt

Hunt Rules

Hunt Tasks

Hunt Tasks (2)

Hunt Debrief

Python HDFS Access

CLI vs. API

Python Library

Read and Write

DataFrame Upload

DataFrame Reads

File Management

File Management (2)

Live Demo

PySpark and HDFS

PySpark + HDFS

Writing DataFrames

RDDs and HDFS

HDFS Path Patterns

Path Patterns (2)

HDFS Best Practices

File Organization

Format Guide

Small Files Problem

Homework Overview

HDFS Homework

HDFS Homework (2)

HW Walkthrough

Work Time

Session 2 Takeaways

Takeaways (2)

What's Missing?

HDFS Operation Gaps

What Comes Next

Next Week Preview