HDFS Operations and Programming

CS 6500 — Week 2, Session 2

CS 6500 — Big Data Analytics | Week 2

Session 1 Recap

Architecture foundations we covered:

  • NameNode: metadata master (namespace + block map in RAM)
  • DataNodes: block storage workers, heartbeat to NameNode
  • Block size 128MB: balances NameNode memory and parallelism
  • Replication factor 3 + rack awareness: fault tolerance
  • Data locality: move computation to data, not data to computation

Today: Stop talking about HDFS — start using it.

CS 6500 — Big Data Analytics | Week 2

The Question

"You have 900TB of web server logs on HDFS. A colleague claims you can query any file from any node in the cluster. Another says touching data from the wrong node will make your job ten times slower. Who is right — and how would you know before running the job?"

CS 6500 — Big Data Analytics | Week 2

Today's Answer

Both are right — and the difference is data locality.

HDFS CLI: what you can do
Navigate, upload, download, inspect, replicate, and administer files from the terminal — the foundation for every job you'll run this semester

Programmatic access: why it matters
Automate everything: Python hdfs library writes pipelines that skip the terminal entirely; PySpark reads HDFS paths natively and respects locality automatically

You leave this session with CLI confidence and a notebook that reads and writes HDFS data without leaving Python.

CS 6500 — Big Data Analytics | Week 2

HDFS CLI

hdfs dfs — your primary tool for interacting with the cluster

CS 6500 — Big Data Analytics | Week 2

Connect to Cluster

All HDFS commands go through the NameNode container:

# Shell into the NameNode
docker exec -it namenode bash

# Verify HDFS is running
hdfs dfs -ls /

You should see /data, /tmp, and /user listed.

Note: hdfs dfs vs hadoop fs — both work; hdfs dfs is HDFS-specific.

CS 6500 — Big Data Analytics | Week 2

Navigation Commands

# List directory contents
hdfs dfs -ls /
hdfs dfs -ls /user

# List with human-readable sizes
hdfs dfs -ls -h /data

# Recursive listing
hdfs dfs -ls -R /user

HDFS paths are always absolute. There's no cd — include the full path every time.

CS 6500 — Big Data Analytics | Week 2

Make Directories

# Create a single directory
hdfs dfs -mkdir /user/student

# Create a directory and all parents (like mkdir -p)
hdfs dfs -mkdir -p /user/student/hw1/input
hdfs dfs -mkdir -p /user/student/hw1/output
hdfs dfs -mkdir -p /user/student/hw1/backup

# Verify
hdfs dfs -ls /user/student/hw1
CS 6500 — Big Data Analytics | Week 2

Uploading Files

# Create a test file locally
echo "Hello HDFS" > /tmp/hello.txt

# Upload to HDFS
hdfs dfs -put /tmp/hello.txt /user/student/

# Equivalent: copyFromLocal (more explicit name)
hdfs dfs -copyFromLocal /tmp/hello.txt /user/student/hello2.txt

# Verify upload and view content
hdfs dfs -ls /user/student/
hdfs dfs -cat /user/student/hello.txt
CS 6500 — Big Data Analytics | Week 2

Downloading Files

# Download from HDFS to local filesystem
hdfs dfs -get /user/student/hello.txt /tmp/downloaded.txt

# Verify
cat /tmp/downloaded.txt

# getmerge: combine multiple HDFS files into one local file
# (Very useful for MapReduce output — splits across many part-* files)
hdfs dfs -getmerge /user/student/output/ /tmp/combined_results.txt

Why -getmerge? MapReduce jobs write output as part-00000, part-00001, etc.
-getmerge combines them into a single file in the correct order.

CS 6500 — Big Data Analytics | Week 2

Inspecting Files

# Print file contents (like cat)
hdfs dfs -cat /user/student/hello.txt

# Print last N lines (like tail)
hdfs dfs -tail /user/student/large_file.txt

# File stats: %n=name, %b=bytes, %r=replication, %o=block size
hdfs dfs -stat "%n  size:%b  repl:%r  block:%o" /user/student/hello.txt
CS 6500 — Big Data Analytics | Week 2

Block Health Check

# Check file integrity and block locations
hdfs fsck /user/student/hello.txt -files -blocks -locations

hdfs fsck output (sample):

/user/student/hello.txt: Under replicated BP...
 Total size: 11 B
 Total blocks: 1 (avg block size 11 B)
 No. of blocks: 1
Status: HEALTHY
CS 6500 — Big Data Analytics | Week 2

File Operations

# Copy within HDFS
hdfs dfs -cp /user/student/hello.txt /user/student/hello_backup.txt

# Move/rename within HDFS
hdfs dfs -mv /user/student/hello_backup.txt /user/student/hw1/backup/

# Remove a file (goes to Trash by default)
hdfs dfs -rm /user/student/hello.txt

# Remove a directory and all contents (use with care!)
hdfs dfs -rm -r /user/student/old_data/

HDFS Trash: Deleted files go to /user/<name>/.Trash/ and are permanently deleted after a configurable interval (usually 24h).

CS 6500 — Big Data Analytics | Week 2

Space Commands

# Disk usage (how much space a directory uses)
hdfs dfs -du /user/student/
hdfs dfs -du -h /user/student/         # human-readable
hdfs dfs -du -s /user/student/         # summary (total only)

# Disk free (cluster-wide storage)
hdfs dfs -df -h /

# Change replication factor
hdfs dfs -setrep 2 /user/student/hello.txt
hdfs dfs -setrep -w 2 /user/student/hello.txt   # wait for completion
CS 6500 — Big Data Analytics | Week 2

Admin Commands

# Full cluster health check
hdfs fsck / -files -blocks

# NameNode status and metrics
hdfs dfsadmin -report

# Safe mode (cluster read-only for maintenance)
hdfs dfsadmin -safemode enter
hdfs dfsadmin -safemode leave

Web UI: NameNode dashboard at http://localhost:9870 — block reports, DataNode status, live metrics.

CS 6500 — Big Data Analytics | Week 2

HDFS Scavenger Hunt

20 minutes — hands-on CLI challenge

CS 6500 — Big Data Analytics | Week 2

Hunt Rules

Work in pairs. First team to complete all tasks wins +1% on homework.

Open a shell to the NameNode:

docker exec -it namenode bash

Tasks on the next slide — go!

CS 6500 — Big Data Analytics | Week 2

Hunt Tasks

  1. Discovery: Find the total size of everything under /user/ (hint: -du -s)
  2. Setup: Create /user/student/yourname/ with 3 subdirectories
  3. Upload: Generate a 1000-line file locally (seq 1 1000 > /tmp/numbers.txt) and upload it
  4. Inspect: What is the replication factor of your uploaded file? What DataNodes hold it?
  5. Replication: Change replication to 2, wait for completion, verify with fsck
CS 6500 — Big Data Analytics | Week 2

Hunt Tasks (2)

  1. View: Use -tail to see the last 10 lines of your file
  2. Merge: Create 3 small files and use -getmerge to combine them locally
  3. Cleanup: Delete everything you created in /user/student/yourname/
CS 6500 — Big Data Analytics | Week 2

Hunt Debrief

Common discoveries:

  • Small files still occupy a full block on each DataNode (replication × block overhead)
  • hdfs fsck shows exactly which DataNodes hold your blocks
  • -setrep -w can take a minute — replication happens in the background
  • -getmerge order is alphabetical by filename — important for part-0000N output files

Discussion: Why did we set replication to 2? When would you do this in production?

(Answer: testing environments, intermediate data, cost reduction when durability is less critical)

CS 6500 — Big Data Analytics | Week 2

Python HDFS Access

Python and PySpark APIs for automation and integration

CS 6500 — Big Data Analytics | Week 2

CLI vs. API

The CLI is great for:

  • Manual exploration and debugging
  • One-off data transfers
  • Administrative tasks

APIs are necessary for:

  • Automated pipelines (ETL workflows)
  • Application integration (upload user data directly)
  • Complex conditional logic (upload only if file doesn't exist)
  • Testing and validation in CI/CD pipelines
CS 6500 — Big Data Analytics | Week 2

Python Library

Already in Docker Jupyter — connect via WebHDFS:

from hdfs import InsecureClient

# Connect via WebHDFS (port 9870 in our Docker setup)
client = InsecureClient('http://namenode:9870', user='hadoop')

# List a directory
print(client.list('/user/student'))
# → ['hello.txt', 'hw1']

The hdfs library uses the WebHDFS REST interface — no Java required.

CS 6500 — Big Data Analytics | Week 2

Read and Write

from hdfs import InsecureClient
client = InsecureClient('http://namenode:9870', user='hadoop')

# Write a file to HDFS
with client.write('/user/student/data.csv', encoding='utf-8') as writer:
    writer.write("id,value\n1,alpha\n2,beta\n3,gamma\n")

# Read a file from HDFS
with client.read('/user/student/data.csv', encoding='utf-8') as reader:
    content = reader.read()
    print(content)
CS 6500 — Big Data Analytics | Week 2

DataFrame Upload

import pandas as pd
from hdfs import InsecureClient

client = InsecureClient('http://namenode:9870', user='hadoop')

# Upload a Pandas DataFrame directly to HDFS
df = pd.DataFrame({'id': [1, 2, 3], 'value': ['alpha', 'beta', 'gamma']})
with client.write('/user/student/df.csv', encoding='utf-8') as writer:
    df.to_csv(writer, index=False)
CS 6500 — Big Data Analytics | Week 2

DataFrame Reads

from hdfs import InsecureClient
import pandas as pd

client = InsecureClient('http://namenode:9870', user='hadoop')

# Read CSV from HDFS into Pandas DataFrame
with client.read('/user/student/df.csv', encoding='utf-8') as reader:
    df = pd.read_csv(reader)

print(df.shape)        # (3, 2)
print(df.columns.tolist())  # ['id', 'value']
CS 6500 — Big Data Analytics | Week 2

File Management

client = InsecureClient('http://namenode:9870', user='hadoop')

# Create a directory
client.makedirs('/user/student/pipeline_output')

# Check if file exists
if client.status('/user/student/data.csv', strict=False):
    print("File exists")
else:
    print("File not found")
CS 6500 — Big Data Analytics | Week 2

File Management (2)

# Get file metadata
status = client.status('/user/student/data.csv')
print(f"Size: {status['length']} bytes")
print(f"Replication: {status['replication']}")
print(f"Block size: {status['blockSize']}")

# Delete a file
client.delete('/user/student/old_data.csv')
CS 6500 — Big Data Analytics | Week 2

Live Demo

Open Jupyter: http://localhost:8888

Demo notebook flow:

  1. Connect to HDFS NameNode
  2. Create a directory structure
  3. Generate sample log data (10,000 lines)
  4. Upload via client.write()
  5. Read back, parse HTTP status codes, count each
  6. Save results as CSV to HDFS

Follow along in your own notebook.

CS 6500 — Big Data Analytics | Week 2

PySpark and HDFS

Reading and writing DataFrames at scale

CS 6500 — Big Data Analytics | Week 2

PySpark + HDFS

PySpark reads/writes HDFS natively — no separate library needed:

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("HDFS Demo").getOrCreate()

# Read CSV from HDFS (namenode:9000 = RPC port, not 9870)
df = spark.read.csv("hdfs://namenode:9000/user/student/data.csv",
                    header=True, inferSchema=True)
df.show()
CS 6500 — Big Data Analytics | Week 2

Writing DataFrames

# Write DataFrame back to HDFS as Parquet
df.write.parquet("hdfs://namenode:9000/user/student/output/data.parquet")

# Write as CSV with header
df.write.csv("hdfs://namenode:9000/user/student/output/data_csv",
             header=True, mode="overwrite")

HDFS output: Each Spark partition writes one file (part-00000-abc123.csv). Use coalesce(1) or -getmerge for a single output file.

CS 6500 — Big Data Analytics | Week 2

RDDs and HDFS

# Read text file as RDD (one element per line)
rdd = spark.sparkContext.textFile(
    "hdfs://namenode:9000/user/student/logs.txt"
)
print(f"Total lines: {rdd.count()}")

# Filter for ERROR lines
errors = rdd.filter(lambda line: "ERROR" in line)
print(f"Error lines: {errors.count()}")
CS 6500 — Big Data Analytics | Week 2

HDFS Path Patterns

# Read all CSV files in a directory
df = spark.read.csv("hdfs://namenode:9000/user/student/data/*.csv",
                    header=True)

# Read with glob patterns
df = spark.read.parquet(
    "hdfs://namenode:9000/data/logs/year=2025/month=*/")
CS 6500 — Big Data Analytics | Week 2

Path Patterns (2)

# Read multiple specific paths
df = spark.read.text([
    "hdfs://namenode:9000/data/jan.log",
    "hdfs://namenode:9000/data/feb.log",
    "hdfs://namenode:9000/data/mar.log"
])

# Default HDFS (configured in core-site.xml)
df = spark.read.csv("/user/student/data.csv")
CS 6500 — Big Data Analytics | Week 2

HDFS Best Practices

Design patterns for efficient use of distributed storage

CS 6500 — Big Data Analytics | Week 2

File Organization

/data/raw/
  access_logs/
    year=2025/month=01/    ← partition directories
    year=2025/month=02/
/user/student/hw1/input/
/tmp/                      ← temporary intermediate data

Partition directories (Hive-style: key=value) enable partition pruning in Spark — huge performance win for filtering.

CS 6500 — Big Data Analytics | Week 2

Format Guide

Use for HDFS:

  • Parquet — columnar, compressed, splittable; best for Spark analytics
  • ORC — columnar, optimized for Hive; similar to Parquet
  • Avro — row-based, great for schema evolution; good for streaming ingestion
  • Text/CSV — human-readable, slow; only for small datasets

Avoid on HDFS:

  • ZIP/tar — not splittable (1 mapper processes entire archive)
  • Gzip CSV — not splittable (same problem)
  • Many tiny files — NameNode OOM, poor parallelism

Splittable compression: Snappy, LZO, Bzip2 — can be processed in parallel.

CS 6500 — Big Data Analytics | Week 2

Small Files Problem

Problem: 1 million 1KB files = 150MB NameNode RAM + slow MapReduce

Solution 1: HAR (Hadoop Archive)

hadoop archive -archiveName logs.har \
  -p /user/student/small_files/ /user/student/

Solution 2: Sequence Files — bundle many key-value pairs into one binary file; splittable and indexed

Solution 3: Consolidate at ingestion — batch-merge into large HDFS files hourly via Kafka/S3

CS 6500 — Big Data Analytics | Week 2

Homework Overview

What you'll build this week

CS 6500 — Big Data Analytics | Week 2

HDFS Homework

Three parts — due Sunday 11:59 PM

Part 1: Command-Line Operations (40 pts)

  • Build a directory structure in HDFS
  • Upload a 100MB log dataset
  • Run a replication experiment (change factor, verify with fsck)
  • File manipulation (copy, rename, tail)
  • Space analysis (why does -du show 3× the file size?)
CS 6500 — Big Data Analytics | Week 2

HDFS Homework (2)

Part 2: Programmatic Access (30 pts)

  • Python hdfs library: list, upload, parse, save results
  • PySpark: load file as RDD, count lines, filter errors, save output

Part 3: Analysis Questions (30 pts)

  • Draw a block diagram for a 512MB file with replication=3
  • Failure scenario analysis
  • Why HDFS is wrong for tiny files and random access
CS 6500 — Big Data Analytics | Week 2

HW Walkthrough

Let's do Part 1 Task 1 together:

# Enter NameNode container
docker exec -it namenode bash

# Create the directory structure
hdfs dfs -mkdir -p /user/student/hw1/input
hdfs dfs -mkdir -p /user/student/hw1/output

Your turn: Complete Tasks 2–5 using the commands from today's session.

Submission: ZIP containing commands.txt, hw1_notebook.ipynb, analysis.pdf

CS 6500 — Big Data Analytics | Week 2

Work Time

Use the remaining time to begin Part 1.

I'm circulating — flag me if you hit:

  • Docker container not running (docker compose up -d)
  • Permission errors (you may need sudo in some commands)
  • NameNode in safe mode (hdfs dfsadmin -safemode leave)
# Verify all containers are running
docker compose ps
CS 6500 — Big Data Analytics | Week 2

Session 2 Takeaways

HDFS CLI skills you now have:

  • Navigate, create, upload, download, inspect, copy, delete
  • Check replication with fsck, modify with -setrep
  • Measure space with -du and -df
CS 6500 — Big Data Analytics | Week 2

Takeaways (2)

Programmatic access:

  • Python hdfs library: connect via WebHDFS, read/write files, manage directories
  • PySpark: native HDFS paths (hdfs://namenode:9000/...), read/write DataFrames and RDDs

Design principles:

  • Large files > many small files
  • Columnar formats (Parquet, ORC) for analytics
  • Partition directories for query performance
CS 6500 — Big Data Analytics | Week 2

What's Missing?

HDFS stores and serves data — but it cannot perform any computation on that data itself

CS 6500 — Big Data Analytics | Week 2

HDFS Operation Gaps

  • No parallel computation — you can upload 900TB and retrieve any block, but aggregating it requires something that can run simultaneously on all those blocks across all those nodes
  • No structured access — HDFS gives you raw bytes; there's no concept of rows, columns, or schemas; every application must parse the format itself
  • Batch-only paradigm — HDFS is optimized for write-once sequential reads; updating a record, streaming in new events, or running low-latency queries all require additional layers
  • No job orchestration — uploading data and computing on it are completely separate concerns; someone has to coordinate which nodes run which tasks against which blocks
CS 6500 — Big Data Analytics | Week 2

What Comes Next

Gap Solution When
Parallel computation on HDFS blocks MapReduce — data-local map + shuffle + reduce Week 3
Python-friendly MapReduce mrjob — write mappers/reducers as Python classes Week 3, Session 2
Faster iteration, in-memory compute Apache Spark — DAG execution over HDFS data Week 5

HDFS is the foundation — every computation engine you'll study this semester reads and writes through exactly the commands you practiced today.

CS 6500 — Big Data Analytics | Week 2

Next Week Preview

Week 3: MapReduce — Design Rationale and Implementation

  • The MapReduce programming model (map, shuffle, reduce)
  • Why MapReduce was a breakthrough (and where it falls short)
  • Writing MapReduce jobs with Python and mrjob
  • Running jobs on the Docker cluster

Before next session:

  • Read the Google MapReduce paper (Dean & Ghemawat, 2004) — especially Sections 1–3
  • Review Python functional programming: map(), filter(), reduce()
  • Think about this: How would you count words in a 1TB text file?
CS 6500 — Big Data Analytics | Week 2

Speaker context: This is a hands-on session. Minimize lecture, maximize student time at the terminal. The scavenger hunt activity should occupy ~20 minutes of productive struggle. The goal is that every student leaves having successfully uploaded, inspected, and retrieved a file from HDFS, and having run a Python snippet that touches HDFS programmatically. Confidence at the CLI is a prerequisite for everything we do in Weeks 3–8.

Speaker notes: Expected output: Found 3 items drwxr-xr-x - hadoop supergroup 0 2026-01-15 09:00 /data drwxr-xr-x - hadoop supergroup 0 2026-01-15 09:00 /tmp drwxr-xr-x - hadoop supergroup 0 2026-01-15 09:00 /user

Speaker notes: HDFS doesn't have a true current working directory. All paths must be absolute (e.g., /user/hadoop, not just hadoop).

Speaker notes: Expected output shows 3 directories: hw1/backup, hw1/input, hw1/output.

Speaker notes: Pro tip — `-put` will fail if the destination already exists. Add `-f` to overwrite: `hdfs dfs -put -f /tmp/hello.txt /user/student/hello.txt`

Speaker notes: `-du` shows 3× the file size because a 10MB file with replication=3 uses 30MB of raw disk across the cluster.

Speaker notes: Also available: df.write.orc("hdfs://namenode:9000/user/student/output/data.orc") — columnar, efficient for analytics.

Speaker notes: Save results back: errors.saveAsTextFile("hdfs://namenode:9000/user/student/errors_only")

Speaker notes: Verify with `hdfs dfs -ls /user/student/hw1` — should show 3 directories.