Previous slide Next slide Toggle fullscreen Toggle overview view Open presenter view
HDFS Operations and Programming
CS 6500 — Week 2, Session 2
CS 6500 — Big Data Analytics | Week 2
Session 1 Recap
Architecture foundations we covered:
NameNode: metadata master (namespace + block map in RAM)
DataNodes: block storage workers, heartbeat to NameNode
Block size 128MB: balances NameNode memory and parallelism
Replication factor 3 + rack awareness: fault tolerance
Data locality: move computation to data, not data to computation
Today: Stop talking about HDFS — start using it.
CS 6500 — Big Data Analytics | Week 2
The Question
"You have 900TB of web server logs on HDFS. A colleague claims you can query any file from any node in the cluster. Another says touching data from the wrong node will make your job ten times slower. Who is right — and how would you know before running the job?"
CS 6500 — Big Data Analytics | Week 2
Today's Answer
Both are right — and the difference is data locality.
HDFS CLI: what you can do
Navigate, upload, download, inspect, replicate, and administer files from the terminal — the foundation for every job you'll run this semester
Programmatic access: why it matters
Automate everything: Python hdfs library writes pipelines that skip the terminal entirely; PySpark reads HDFS paths natively and respects locality automatically
You leave this session with CLI confidence and a notebook that reads and writes HDFS data without leaving Python.
CS 6500 — Big Data Analytics | Week 2
HDFS CLI
hdfs dfs — your primary tool for interacting with the cluster
CS 6500 — Big Data Analytics | Week 2
Connect to Cluster
All HDFS commands go through the NameNode container:
docker exec -it namenode bash
hdfs dfs -ls /
You should see /data, /tmp, and /user listed.
Note: hdfs dfs vs hadoop fs — both work; hdfs dfs is HDFS-specific.
CS 6500 — Big Data Analytics | Week 2
Navigation Commands
hdfs dfs -ls /
hdfs dfs -ls /user
hdfs dfs -ls -h /data
hdfs dfs -ls -R /user
HDFS paths are always absolute. There's no cd — include the full path every time.
CS 6500 — Big Data Analytics | Week 2
Make Directories
hdfs dfs -mkdir /user/student
hdfs dfs -mkdir -p /user/student/hw1/input
hdfs dfs -mkdir -p /user/student/hw1/output
hdfs dfs -mkdir -p /user/student/hw1/backup
hdfs dfs -ls /user/student/hw1
CS 6500 — Big Data Analytics | Week 2
Uploading Files
echo "Hello HDFS" > /tmp/hello.txt
hdfs dfs -put /tmp/hello.txt /user/student/
hdfs dfs -copyFromLocal /tmp/hello.txt /user/student/hello2.txt
hdfs dfs -ls /user/student/
hdfs dfs -cat /user/student/hello.txt
CS 6500 — Big Data Analytics | Week 2
Downloading Files
hdfs dfs -get /user/student/hello.txt /tmp/downloaded.txt
cat /tmp/downloaded.txt
hdfs dfs -getmerge /user/student/output/ /tmp/combined_results.txt
Why -getmerge? MapReduce jobs write output as part-00000, part-00001, etc.
-getmerge combines them into a single file in the correct order.
CS 6500 — Big Data Analytics | Week 2
Inspecting Files
hdfs dfs -cat /user/student/hello.txt
hdfs dfs -tail /user/student/large_file.txt
hdfs dfs -stat "%n size:%b repl:%r block:%o" /user/student/hello.txt
CS 6500 — Big Data Analytics | Week 2
Block Health Check
hdfs fsck /user/student/hello.txt -files -blocks -locations
hdfs fsck output (sample):
/user/student/hello.txt: Under replicated BP...
Total size: 11 B
Total blocks: 1 (avg block size 11 B)
No. of blocks: 1
Status: HEALTHY
CS 6500 — Big Data Analytics | Week 2
File Operations
hdfs dfs -cp /user/student/hello.txt /user/student/hello_backup.txt
hdfs dfs -mv /user/student/hello_backup.txt /user/student/hw1/backup/
hdfs dfs -rm /user/student/hello.txt
hdfs dfs -rm -r /user/student/old_data/
HDFS Trash: Deleted files go to /user/<name>/.Trash/ and are permanently deleted after a configurable interval (usually 24h).
CS 6500 — Big Data Analytics | Week 2
Space Commands
hdfs dfs -du /user/student/
hdfs dfs -du -h /user/student/
hdfs dfs -du -s /user/student/
hdfs dfs -df -h /
hdfs dfs -setrep 2 /user/student/hello.txt
hdfs dfs -setrep -w 2 /user/student/hello.txt
CS 6500 — Big Data Analytics | Week 2
Admin Commands
hdfs fsck / -files -blocks
hdfs dfsadmin -report
hdfs dfsadmin -safemode enter
hdfs dfsadmin -safemode leave
Web UI: NameNode dashboard at http://localhost:9870 — block reports, DataNode status, live metrics.
CS 6500 — Big Data Analytics | Week 2
HDFS Scavenger Hunt
20 minutes — hands-on CLI challenge
CS 6500 — Big Data Analytics | Week 2
Hunt Rules
Work in pairs. First team to complete all tasks wins +1% on homework.
Open a shell to the NameNode:
docker exec -it namenode bash
Tasks on the next slide — go!
CS 6500 — Big Data Analytics | Week 2
Hunt Tasks
Discovery: Find the total size of everything under /user/ (hint: -du -s)
Setup: Create /user/student/yourname/ with 3 subdirectories
Upload: Generate a 1000-line file locally (seq 1 1000 > /tmp/numbers.txt) and upload it
Inspect: What is the replication factor of your uploaded file? What DataNodes hold it?
Replication: Change replication to 2, wait for completion, verify with fsck
CS 6500 — Big Data Analytics | Week 2
Hunt Tasks (2)
View: Use -tail to see the last 10 lines of your file
Merge: Create 3 small files and use -getmerge to combine them locally
Cleanup: Delete everything you created in /user/student/yourname/
CS 6500 — Big Data Analytics | Week 2
Hunt Debrief
Common discoveries:
Small files still occupy a full block on each DataNode (replication × block overhead)
hdfs fsck shows exactly which DataNodes hold your blocks
-setrep -w can take a minute — replication happens in the background
-getmerge order is alphabetical by filename — important for part-0000N output files
Discussion: Why did we set replication to 2? When would you do this in production?
(Answer: testing environments, intermediate data, cost reduction when durability is less critical)
CS 6500 — Big Data Analytics | Week 2
Python HDFS Access
Python and PySpark APIs for automation and integration
CS 6500 — Big Data Analytics | Week 2
CLI vs. API
The CLI is great for:
Manual exploration and debugging
One-off data transfers
Administrative tasks
APIs are necessary for:
Automated pipelines (ETL workflows)
Application integration (upload user data directly)
Complex conditional logic (upload only if file doesn't exist)
Testing and validation in CI/CD pipelines
CS 6500 — Big Data Analytics | Week 2
Python Library
Already in Docker Jupyter — connect via WebHDFS:
from hdfs import InsecureClient
client = InsecureClient('http://namenode:9870' , user='hadoop' )
print (client.list ('/user/student' ))
The hdfs library uses the WebHDFS REST interface — no Java required.
CS 6500 — Big Data Analytics | Week 2
Read and Write
from hdfs import InsecureClient
client = InsecureClient('http://namenode:9870' , user='hadoop' )
with client.write('/user/student/data.csv' , encoding='utf-8' ) as writer:
writer.write("id,value\n1,alpha\n2,beta\n3,gamma\n" )
with client.read('/user/student/data.csv' , encoding='utf-8' ) as reader:
content = reader.read()
print (content)
CS 6500 — Big Data Analytics | Week 2
DataFrame Upload
import pandas as pd
from hdfs import InsecureClient
client = InsecureClient('http://namenode:9870' , user='hadoop' )
df = pd.DataFrame({'id' : [1 , 2 , 3 ], 'value' : ['alpha' , 'beta' , 'gamma' ]})
with client.write('/user/student/df.csv' , encoding='utf-8' ) as writer:
df.to_csv(writer, index=False )
CS 6500 — Big Data Analytics | Week 2
DataFrame Reads
from hdfs import InsecureClient
import pandas as pd
client = InsecureClient('http://namenode:9870' , user='hadoop' )
with client.read('/user/student/df.csv' , encoding='utf-8' ) as reader:
df = pd.read_csv(reader)
print (df.shape)
print (df.columns.tolist())
CS 6500 — Big Data Analytics | Week 2
File Management
client = InsecureClient('http://namenode:9870' , user='hadoop' )
client.makedirs('/user/student/pipeline_output' )
if client.status('/user/student/data.csv' , strict=False ):
print ("File exists" )
else :
print ("File not found" )
CS 6500 — Big Data Analytics | Week 2
File Management (2)
status = client.status('/user/student/data.csv' )
print (f"Size: {status['length' ]} bytes" )
print (f"Replication: {status['replication' ]} " )
print (f"Block size: {status['blockSize' ]} " )
client.delete('/user/student/old_data.csv' )
CS 6500 — Big Data Analytics | Week 2
Live Demo
Open Jupyter: http://localhost:8888
Demo notebook flow:
Connect to HDFS NameNode
Create a directory structure
Generate sample log data (10,000 lines)
Upload via client.write()
Read back, parse HTTP status codes, count each
Save results as CSV to HDFS
Follow along in your own notebook.
CS 6500 — Big Data Analytics | Week 2
PySpark and HDFS
Reading and writing DataFrames at scale
CS 6500 — Big Data Analytics | Week 2
PySpark + HDFS
PySpark reads/writes HDFS natively — no separate library needed:
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("HDFS Demo" ).getOrCreate()
df = spark.read.csv("hdfs://namenode:9000/user/student/data.csv" ,
header=True , inferSchema=True )
df.show()
CS 6500 — Big Data Analytics | Week 2
Writing DataFrames
df.write.parquet("hdfs://namenode:9000/user/student/output/data.parquet" )
df.write.csv("hdfs://namenode:9000/user/student/output/data_csv" ,
header=True , mode="overwrite" )
HDFS output: Each Spark partition writes one file (part-00000-abc123.csv). Use coalesce(1) or -getmerge for a single output file.
CS 6500 — Big Data Analytics | Week 2
RDDs and HDFS
rdd = spark.sparkContext.textFile(
"hdfs://namenode:9000/user/student/logs.txt"
)
print (f"Total lines: {rdd.count()} " )
errors = rdd.filter (lambda line: "ERROR" in line)
print (f"Error lines: {errors.count()} " )
CS 6500 — Big Data Analytics | Week 2
HDFS Path Patterns
df = spark.read.csv("hdfs://namenode:9000/user/student/data/*.csv" ,
header=True )
df = spark.read.parquet(
"hdfs://namenode:9000/data/logs/year=2025/month=*/" )
CS 6500 — Big Data Analytics | Week 2
Path Patterns (2)
df = spark.read.text([
"hdfs://namenode:9000/data/jan.log" ,
"hdfs://namenode:9000/data/feb.log" ,
"hdfs://namenode:9000/data/mar.log"
])
df = spark.read.csv("/user/student/data.csv" )
CS 6500 — Big Data Analytics | Week 2
HDFS Best Practices
Design patterns for efficient use of distributed storage
CS 6500 — Big Data Analytics | Week 2
File Organization
/data/raw/
access_logs/
year=2025/month=01/ ← partition directories
year=2025/month=02/
/user/student/hw1/input/
/tmp/ ← temporary intermediate data
Partition directories (Hive-style: key=value) enable partition pruning in Spark — huge performance win for filtering.
CS 6500 — Big Data Analytics | Week 2
Use for HDFS:
Parquet — columnar, compressed, splittable; best for Spark analytics
ORC — columnar, optimized for Hive; similar to Parquet
Avro — row-based, great for schema evolution; good for streaming ingestion
Text/CSV — human-readable, slow; only for small datasets
Avoid on HDFS:
ZIP/tar — not splittable (1 mapper processes entire archive)
Gzip CSV — not splittable (same problem)
Many tiny files — NameNode OOM, poor parallelism
Splittable compression: Snappy, LZO, Bzip2 — can be processed in parallel.
CS 6500 — Big Data Analytics | Week 2
Small Files Problem
Problem: 1 million 1KB files = 150MB NameNode RAM + slow MapReduce
Solution 1: HAR (Hadoop Archive)
hadoop archive -archiveName logs.har \
-p /user/student/small_files/ /user/student/
Solution 2: Sequence Files — bundle many key-value pairs into one binary file; splittable and indexed
Solution 3: Consolidate at ingestion — batch-merge into large HDFS files hourly via Kafka/S3
CS 6500 — Big Data Analytics | Week 2
Homework Overview
What you'll build this week
CS 6500 — Big Data Analytics | Week 2
HDFS Homework
Three parts — due Sunday 11:59 PM
Part 1: Command-Line Operations (40 pts)
Build a directory structure in HDFS
Upload a 100MB log dataset
Run a replication experiment (change factor, verify with fsck)
File manipulation (copy, rename, tail)
Space analysis (why does -du show 3× the file size?)
CS 6500 — Big Data Analytics | Week 2
HDFS Homework (2)
Part 2: Programmatic Access (30 pts)
Python hdfs library: list, upload, parse, save results
PySpark: load file as RDD, count lines, filter errors, save output
Part 3: Analysis Questions (30 pts)
Draw a block diagram for a 512MB file with replication=3
Failure scenario analysis
Why HDFS is wrong for tiny files and random access
CS 6500 — Big Data Analytics | Week 2
HW Walkthrough
Let's do Part 1 Task 1 together:
docker exec -it namenode bash
hdfs dfs -mkdir -p /user/student/hw1/input
hdfs dfs -mkdir -p /user/student/hw1/output
Your turn: Complete Tasks 2–5 using the commands from today's session.
Submission: ZIP containing commands.txt, hw1_notebook.ipynb, analysis.pdf
CS 6500 — Big Data Analytics | Week 2
Work Time
Use the remaining time to begin Part 1.
I'm circulating — flag me if you hit:
Docker container not running (docker compose up -d)
Permission errors (you may need sudo in some commands)
NameNode in safe mode (hdfs dfsadmin -safemode leave)
docker compose ps
CS 6500 — Big Data Analytics | Week 2
Session 2 Takeaways
HDFS CLI skills you now have:
Navigate, create, upload, download, inspect, copy, delete
Check replication with fsck, modify with -setrep
Measure space with -du and -df
CS 6500 — Big Data Analytics | Week 2
Takeaways (2)
Programmatic access:
Python hdfs library: connect via WebHDFS, read/write files, manage directories
PySpark: native HDFS paths (hdfs://namenode:9000/...), read/write DataFrames and RDDs
Design principles:
Large files > many small files
Columnar formats (Parquet, ORC) for analytics
Partition directories for query performance
CS 6500 — Big Data Analytics | Week 2
What's Missing?
HDFS stores and serves data — but it cannot perform any computation on that data itself
CS 6500 — Big Data Analytics | Week 2
HDFS Operation Gaps
No parallel computation — you can upload 900TB and retrieve any block, but aggregating it requires something that can run simultaneously on all those blocks across all those nodes
No structured access — HDFS gives you raw bytes; there's no concept of rows, columns, or schemas; every application must parse the format itself
Batch-only paradigm — HDFS is optimized for write-once sequential reads; updating a record, streaming in new events, or running low-latency queries all require additional layers
No job orchestration — uploading data and computing on it are completely separate concerns; someone has to coordinate which nodes run which tasks against which blocks
CS 6500 — Big Data Analytics | Week 2
What Comes Next
Gap
Solution
When
Parallel computation on HDFS blocks
MapReduce — data-local map + shuffle + reduce
Week 3
Python-friendly MapReduce
mrjob — write mappers/reducers as Python classes
Week 3, Session 2
Faster iteration, in-memory compute
Apache Spark — DAG execution over HDFS data
Week 5
HDFS is the foundation — every computation engine you'll study this semester reads and writes through exactly the commands you practiced today.
CS 6500 — Big Data Analytics | Week 2
Next Week Preview
Week 3: MapReduce — Design Rationale and Implementation
The MapReduce programming model (map, shuffle, reduce)
Why MapReduce was a breakthrough (and where it falls short)
Writing MapReduce jobs with Python and mrjob
Running jobs on the Docker cluster
Before next session:
Read the Google MapReduce paper (Dean & Ghemawat, 2004) — especially Sections 1–3
Review Python functional programming: map(), filter(), reduce()
Think about this: How would you count words in a 1TB text file?
CS 6500 — Big Data Analytics | Week 2
Speaker context: This is a hands-on session. Minimize lecture, maximize student time at the terminal. The scavenger hunt activity should occupy ~20 minutes of productive struggle. The goal is that every student leaves having successfully uploaded, inspected, and retrieved a file from HDFS, and having run a Python snippet that touches HDFS programmatically. Confidence at the CLI is a prerequisite for everything we do in Weeks 3–8.
Speaker notes: Expected output:
Found 3 items
drwxr-xr-x - hadoop supergroup 0 2026-01-15 09:00 /data
drwxr-xr-x - hadoop supergroup 0 2026-01-15 09:00 /tmp
drwxr-xr-x - hadoop supergroup 0 2026-01-15 09:00 /user
Speaker notes: HDFS doesn't have a true current working directory. All paths must be absolute (e.g., /user/hadoop, not just hadoop).
Speaker notes: Expected output shows 3 directories: hw1/backup, hw1/input, hw1/output.
Speaker notes: Pro tip — `-put` will fail if the destination already exists. Add `-f` to overwrite: `hdfs dfs -put -f /tmp/hello.txt /user/student/hello.txt`
Speaker notes: `-du` shows 3× the file size because a 10MB file with replication=3 uses 30MB of raw disk across the cluster.
Speaker notes: Also available: df.write.orc("hdfs://namenode:9000/user/student/output/data.orc") — columnar, efficient for analytics.
Speaker notes: Save results back: errors.saveAsTextFile("hdfs://namenode:9000/user/student/errors_only")
Speaker notes: Verify with `hdfs dfs -ls /user/student/hw1` — should show 3 directories.