Today is hands-on focused
Bring up Docker environment now
We'll start with live demos, then you practice
Quick check: Any lingering questions on architecture?
Theory → Practice transition
Everything you learned Session 1 enables what we do today
Much less lecture today (20 min total)
Mostly demos and hands-on practice
Lab 1 starts in class, due Sunday
Very similar to Unix commands (by design)
Most Unix users feel at home quickly
Key difference: HDFS paths start with / (absolute paths)
Unlike local FS, HDFS has no concept of "current directory"
All paths relative to HDFS root /
In our Docker setup: default FS is hdfs://namenode:9000
Perform live demo with actual Docker container
Narrate what you're doing
Show empty output for new directory
Address any Docker connection issues on the spot
Execute each command, show output
Point out file sizes in ls output
Show that -cat doesn't require download
Students follow along in their terminals
Show actual output after each command
Emphasize: -rm is permanent (no trash by default)
-r for recursive delete (use carefully!)
These operations are metadata-only (NameNode)
Run each command, explain output
-stat format strings: %n name, %b bytes, %r replication, %o block size
-du shows space used (accounts for replication)
-df shows cluster-wide capacity
Demo setrep and fsck with actual output
-w flag waits for replication to complete
fsck shows which DataNodes have blocks (useful for debugging)
getmerge useful when MapReduce creates many part files
Don't spend time on this (reference only)
Important for operations/DevOps roles
We'll use -report to check cluster health in labs
Set timer for 20 minutes
Circulate to help teams
Hint for #2: Use for loop or seq to generate lines
Announce winners, briefly review solutions
Ask winning team to share their approach
Highlight creative solutions
Common mistakes: forgetting -r for directories, not using -w with setrep
CLI is great for exploration, bad for automation
Production pipelines use programmatic access
We'll focus on Python (most accessible)
InsecureClient = no Kerberos authentication (okay for development)
Port 9870 = NameNode WebHDFS port
strict=False prevents exception if file doesn't exist
Run in Jupyter notebook or Python REPL
Show actual upload/download happening
overwrite=True replaces existing file
Binary mode (rb, wb) for non-text files
Execute in Jupyter, show DataFrame output
Common pattern: process data locally, upload results
Or: download from HDFS, analyze, re-upload
For large datasets, use PySpark instead (next demo)
# PySpark HDFS Integration
**Why PySpark for HDFS?**
- Native integration (no manual client setup)
- Distributed processing (can't do that with Pandas!)
- Lazy evaluation (optimized query plans)
- Handles large datasets (TBs) that don't fit in memory
**Basic Spark Read/Write:**
```python
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("HDFS Demo").getOrCreate()
# Read CSV from HDFS
df = spark.read.csv("hdfs://namenode:9000/user/student/students.csv",
header=True, inferSchema=True)
# Process and write back
df.filter(df.score > 85).write.csv(
"hdfs://namenode:9000/user/student/high_scores/", mode="overwrite")
```
<!--
Spark automatically detects HDFS from configuration
Can use short path: /user/student/... if default FS configured
write creates directory with part files (distributed output)
Run in Jupyter notebook
Show word count results
Navigate to HDFS to show output directory structure
Preview: Week 3 covers MapReduce (same logic, different API)
These lessons learned from production experience
Small files problem will bite you (NameNode memory!)
Assignment grading: points off for bad naming conventions
Safe mode happens on cluster startup (normal)
Permission errors common in Docker setup (user mismatch)
Block errors are rare (automatic recovery)
Keep this slide handy for lab troubleshooting!