Week 2, Session 2

HDFS Operations and Programming

CS 6500: Big Data Analytics

Duration: 75 minutes
Format: Live Demonstrations + Hands-On Lab

What We Learned Last Time

Architecture Review:

  • ✓ NameNode: Metadata manager (directory tree, block locations)
  • ✓ DataNodes: Block storage workers
  • ✓ 128MB blocks, replication factor 3
  • ✓ Rack-aware replica placement
  • ✓ Data locality: move computation, not data

Today's Focus: How to actually USE this system!

Today's Learning Objectives

By end of class, you will:

  1. Perform basic HDFS file operations via CLI
  2. Navigate HDFS directory structure confidently
  3. Inspect file metadata and block information
  4. Use Python hdfs library for programmatic access
  5. Integrate HDFS with PySpark for data processing
  6. Begin Lab 1 (HDFS operations and analysis)

HDFS Command-Line Interface

Two Command Formats:

  • hadoop fs — Works with any Hadoop-compatible filesystem (HDFS, S3, local)
  • hdfs dfs — HDFS-specific (recommended)

Basic Syntax:

hdfs dfs -<command> <args>

Common Commands (Unix-like):

  • Navigation: -ls, -mkdir, -cd (implicit), -pwd
  • File ops: -put, -get, -cp, -mv, -rm
  • Inspection: -cat, -tail, -head, -stat
  • Admin: -du, -df, -setrep, -getmerge

HDFS Path Conventions

Absolute Paths:

  • Full URI: hdfs://namenode:9000/user/hadoop/data.txt
  • Shorthand: /user/hadoop/data.txt (uses configured default FS)

User Home Directory:

  • Current user home: /user/<username>/
  • Command: hdfs dfs -ls (no path) → lists your home directory

Common Directories:

  • /user/ — User home directories
  • /tmp/ — Temporary files
  • /data/ — Shared datasets (course convention)

Live Demo — Navigating HDFS

# Connect to NameNode container
docker exec -it namenode bash

# List root directory
hdfs dfs -ls /

# Create your directory
hdfs dfs -mkdir -p /user/student

# List your directory
hdfs dfs -ls /user/student

# Check current user
whoami

Expected Output:

  • Root directory shows /tmp, /user, etc.
  • Empty directory for student

Live Demo — File Upload and Download

# Create local test file
echo "Hello HDFS from CS 6500" > /tmp/hello.txt
cat /tmp/hello.txt

# Upload to HDFS
hdfs dfs -put /tmp/hello.txt /user/student/

# Verify upload
hdfs dfs -ls /user/student/
hdfs dfs -cat /user/student/hello.txt

# Download from HDFS
hdfs dfs -get /user/student/hello.txt /tmp/downloaded.txt
cat /tmp/downloaded.txt

Key Points: -put = local → HDFS | -get = HDFS → local | -cat reads from HDFS

Live Demo — File Operations

# Copy within HDFS
hdfs dfs -cp /user/student/hello.txt /user/student/hello_backup.txt

# List to verify
hdfs dfs -ls /user/student/

# Move/rename
hdfs dfs -mv /user/student/hello_backup.txt /user/student/backup.txt

# Remove file
hdfs dfs -rm /user/student/backup.txt

# Remove directory (recursive)
hdfs dfs -mkdir /user/student/temp
hdfs dfs -rm -r /user/student/temp

Live Demo — Inspecting File Metadata

# Detailed file stats
hdfs dfs -stat "%n %b %r" /user/student/hello.txt
# Output: filename, size in bytes, replication factor

# File checksum
hdfs dfs -stat "Name: %n, Size: %b, Replication: %r, Block size: %o" \
  /user/student/hello.txt

# Disk usage
hdfs dfs -du -h /user/student/

# File system capacity
hdfs dfs -df -h

# View last N bytes
hdfs dfs -tail /user/student/hello.txt

Live Demo — Advanced Commands

Replication Management:

hdfs dfs -stat "%r" /user/student/hello.txt
hdfs dfs -setrep -w 2 /user/student/hello.txt
hdfs dfs -stat "%r" /user/student/hello.txt

Block Inspection:

hdfs fsck /user/student/hello.txt -files -blocks -locations

Merge Small Files:

hdfs dfs -getmerge /user/student/ /tmp/merged.txt

HDFS Administration Commands

For Reference (Instructor/Admin Use):

  • hdfs dfsadmin -report — Cluster health and capacity
  • hdfs dfsadmin -safemode get — Check if NameNode in safe mode
  • hdfs balancer — Redistribute blocks across DataNodes
  • hdfs fsck / — File system health check

Note: These require admin privileges

In Docker Environment:

docker exec namenode hdfs dfsadmin -report

Activity — HDFS Scavenger Hunt

Pair Challenge (20 minutes)

Work with a partner to complete these tasks:

  1. Discovery: What's the total size of /tmp directory?
  2. Upload: Create a local file with 100 lines, upload to /user/student/yourname/
  3. Inspection: What's the replication factor of any file in /?
  4. Manipulation: Copy a file, change its replication to 2, verify
  5. Cleanup: Remove all files in your directory
  6. Challenge: Use getmerge to combine multiple files
  7. Investigation: Use fsck to find block locations for a file

Prize: First team to complete all → +1% on Lab 1!

Activity Debrief — Solutions

# 1. Total size of /tmp
hdfs dfs -du -s -h /tmp

# 2. Create and upload 100-line file
seq 1 100 > /tmp/lines.txt
hdfs dfs -mkdir -p /user/student/yourname
hdfs dfs -put /tmp/lines.txt /user/student/yourname/

# 3. Check replication
hdfs dfs -stat "%r" /path/to/file

# 4. Copy and change replication
hdfs dfs -cp /user/student/yourname/lines.txt /user/student/yourname/lines_copy.txt
hdfs dfs -setrep -w 2 /user/student/yourname/lines_copy.txt
hdfs dfs -stat "%r" /user/student/yourname/lines_copy.txt

Programmatic HDFS Access

Limitations of CLI:

  • Manual, repetitive operations
  • Hard to integrate with applications
  • No complex logic or workflows

Use Cases for APIs:

  • Automation: Scheduled data ingestion pipelines
  • Integration: Web apps reading/writing HDFS
  • Processing: Complex transformations before upload
  • Monitoring: Custom health checks and alerts

Available APIs: Java, Python (hdfs, pyarrow.hdfs), REST (WebHDFS), PySpark

Python hdfs Library Overview

Installation:

pip install hdfs

Basic Usage:

from hdfs import InsecureClient

# Connect to NameNode
client = InsecureClient('http://localhost:9870', user='hadoop')

# List directory
files = client.list('/user/student')
print(files)

# Check if file exists
exists = client.status('/user/student/data.txt', strict=False)

# Get file info
info = client.status('/user/student/data.txt')
print(info)  # {'length': 1024, 'replication': 3, ...}

Live Demo — Python HDFS Upload/Download

from hdfs import InsecureClient

client = InsecureClient('http://namenode:9870', user='hadoop')

# Upload file
with open('local_file.txt', 'rb') as local:
    client.write('/user/student/uploaded.txt', local, overwrite=True)

# Download file
with client.read('/user/student/uploaded.txt') as reader:
    content = reader.read()
    print(content.decode('utf-8'))

# Upload from string
data = "Line 1\nLine 2\nLine 3"
client.write('/user/student/text.txt', data.encode('utf-8'), overwrite=True)

# Download to local file
client.download('/user/student/text.txt', 'local_copy.txt', overwrite=True)

Live Demo — Python HDFS with Pandas

from hdfs import InsecureClient
import pandas as pd

client = InsecureClient('http://namenode:9870', user='hadoop')

# Upload DataFrame as CSV
df = pd.DataFrame({
    'id': [1, 2, 3, 4, 5],
    'name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'score': [85, 92, 78, 95, 88]
})

with client.write('/user/student/students.csv', encoding='utf-8') as writer:
    df.to_csv(writer, index=False)

# Read DataFrame from HDFS
with client.read('/user/student/students.csv', encoding='utf-8') as reader:
    df_loaded = pd.read_csv(reader)

print(df_loaded)
print(f"Average score: {df_loaded['score'].mean()}")

Live Demo — PySpark HDFS Operations

from pyspark.sql import SparkSession
from pyspark.sql.functions import split, explode, col

spark = SparkSession.builder.appName("Week2 Demo").master("local[*]").getOrCreate()

# Read text file from HDFS
lines = spark.read.text("/user/student/sample.txt")
print(f"Total lines: {lines.count()}")

# Word count (preview of MapReduce!)
words = lines.select(explode(split(col("value"), " ")).alias("word"))
word_counts = words.groupBy("word").count().orderBy(col("count").desc())

# Show top 10 words
word_counts.show(10)

# Save results to HDFS
word_counts.write.csv("/user/student/word_counts/", mode="overwrite", header=True)

--- -->

HDFS Best Practices

✅ DO:

  • Store large files (100MB+)
  • Use compression (gzip, snappy) for storage efficiency
  • Organize data in directories (e.g., /data/2024/01/logs/)
  • Use consistent naming conventions
  • Set appropriate replication based on importance

❌ DON'T:

  • Store millions of tiny files (< 1MB)
  • Edit files in place (write-once-read-many!)
  • Use HDFS for transactional data
  • Store files without structure (name files meaningfully!)
  • Forget to clean up temporary data

Common HDFS Errors and Solutions

Error: "No such file or directory"
→ Check path, use hdfs dfs -ls to verify

Error: "Name node is in safe mode"
→ Wait for cluster initialization, or: hdfs dfsadmin -safemode leave

Error: "Permission denied"
→ Check ownership with -ls, or use docker exec as correct user

Error: "Could not obtain block"
→ DataNode failure, HDFS will retry from replica

Error: "QuotaExceededException"
→ Directory quota exceeded (not common in our setup)

--

Today is hands-on focused Bring up Docker environment now We'll start with live demos, then you practice

Quick check: Any lingering questions on architecture? Theory → Practice transition Everything you learned Session 1 enables what we do today

Much less lecture today (20 min total) Mostly demos and hands-on practice Lab 1 starts in class, due Sunday

Very similar to Unix commands (by design) Most Unix users feel at home quickly Key difference: HDFS paths start with / (absolute paths)

Unlike local FS, HDFS has no concept of "current directory" All paths relative to HDFS root / In our Docker setup: default FS is hdfs://namenode:9000

Perform live demo with actual Docker container Narrate what you're doing Show empty output for new directory Address any Docker connection issues on the spot

Execute each command, show output Point out file sizes in ls output Show that -cat doesn't require download Students follow along in their terminals

Show actual output after each command Emphasize: -rm is permanent (no trash by default) -r for recursive delete (use carefully!) These operations are metadata-only (NameNode)

Run each command, explain output -stat format strings: %n name, %b bytes, %r replication, %o block size -du shows space used (accounts for replication) -df shows cluster-wide capacity

Demo setrep and fsck with actual output -w flag waits for replication to complete fsck shows which DataNodes have blocks (useful for debugging) getmerge useful when MapReduce creates many part files

Don't spend time on this (reference only) Important for operations/DevOps roles We'll use -report to check cluster health in labs

Set timer for 20 minutes Circulate to help teams Hint for #2: Use for loop or seq to generate lines Announce winners, briefly review solutions

Ask winning team to share their approach Highlight creative solutions Common mistakes: forgetting -r for directories, not using -w with setrep

CLI is great for exploration, bad for automation Production pipelines use programmatic access We'll focus on Python (most accessible)

InsecureClient = no Kerberos authentication (okay for development) Port 9870 = NameNode WebHDFS port strict=False prevents exception if file doesn't exist

Run in Jupyter notebook or Python REPL Show actual upload/download happening overwrite=True replaces existing file Binary mode (rb, wb) for non-text files

Execute in Jupyter, show DataFrame output Common pattern: process data locally, upload results Or: download from HDFS, analyze, re-upload For large datasets, use PySpark instead (next demo)

# PySpark HDFS Integration **Why PySpark for HDFS?** - Native integration (no manual client setup) - Distributed processing (can't do that with Pandas!) - Lazy evaluation (optimized query plans) - Handles large datasets (TBs) that don't fit in memory **Basic Spark Read/Write:** ```python from pyspark.sql import SparkSession spark = SparkSession.builder.appName("HDFS Demo").getOrCreate() # Read CSV from HDFS df = spark.read.csv("hdfs://namenode:9000/user/student/students.csv", header=True, inferSchema=True) # Process and write back df.filter(df.score > 85).write.csv( "hdfs://namenode:9000/user/student/high_scores/", mode="overwrite") ``` <!-- Spark automatically detects HDFS from configuration Can use short path: /user/student/... if default FS configured write creates directory with part files (distributed output)

Run in Jupyter notebook Show word count results Navigate to HDFS to show output directory structure Preview: Week 3 covers MapReduce (same logic, different API)

These lessons learned from production experience Small files problem will bite you (NameNode memory!) Assignment grading: points off for bad naming conventions

Safe mode happens on cluster startup (normal) Permission errors common in Docker setup (user mismatch) Block errors are rare (automatic recovery) Keep this slide handy for lab troubleshooting!