Previous slide Next slide Toggle fullscreen Toggle overview view Open presenter view
Distributed File Systems and HDFS
CS 6500 — Week 2, Session 1
CS 6500 — Big Data Analytics | Week 2
Week 1 Recap
What we covered:
Big data landscape and the V's (Volume, Velocity, Variety)
The Hadoop ecosystem overview
Docker environment setup and first contact with the cluster
Key insight: Traditional databases hit walls at scale — we need new architectures designed for distributed commodity hardware.
Any environment setup issues? Quick show of hands — we'll troubleshoot during break.
CS 6500 — Big Data Analytics | Week 2
The Question
"Google had petabytes of web crawl data in 2003 and a cluster of 1,000 commodity servers. They couldn't afford enterprise storage. Why didn't they just use NFS — and what did they have to invent instead?"
CS 6500 — Big Data Analytics | Week 2
Today's Path
HDFS is the answer — and understanding why means understanding what NFS got wrong.
Why traditional file systems fail at scale
Single machines, random-access assumptions, and failure rates that make enterprise storage economics impossible at petabyte scale
How HDFS was designed instead
NameNode, DataNodes, 128MB blocks, 3× replication, rack awareness — each design choice is a deliberate response to a specific failure of traditional storage
By end of session you'll be able to trace a read request through the cluster and explain every design decision along the way.
CS 6500 — Big Data Analytics | Week 2
Why FS Can't Scale
The wall that every growing system eventually hits
CS 6500 — Big Data Analytics | Week 2
The Scale Problem
A typical enterprise challenge:
10TB of web server logs generated per day
Need to analyze 90 days of history (900TB)
Query: "Find all sessions from users who later converted to paid"
On a single machine:
900TB doesn't fit on one disk
Even if it did: reading at 500MB/sec → 21 days to scan once
RAID adds capacity but not parallel throughput
The fundamental constraint: I/O bandwidth is physical, not a software problem.
CS 6500 — Big Data Analytics | Week 2
FS Assumptions
Local file systems (ext4, NTFS, APFS) assume:
One machine owns the storage
Random access is common (seek + read small records)
Files change frequently (overwrite, append, truncate)
Concurrency is managed with locks
Failure is rare — hardware is enterprise-grade
These assumptions break completely at petabyte scale on commodity clusters.
CS 6500 — Big Data Analytics | Week 2
DFS Requirements
What a file system for big data must provide:
Requirement
Why It Matters
Horizontal scalability
Add capacity by adding nodes, not replacing hardware
Fault tolerance
Commodity nodes fail daily on 1000-node clusters
High aggregate throughput
Stream data fast across many disks in parallel
Large file support
Analytics workloads operate on huge datasets
Data locality
Move computation to data, not data to computation
CS 6500 — Big Data Analytics | Week 2
GFS (2003)
GFS Paper (Ghemawat, Gobioff, Leung) — the blueprint for HDFS
Key observations Google made:
Component failures are the norm, not the exception
Files are mostly large (multi-GB), not millions of tiny files
Files are written once (or appended), then read many times
Co-designing the file system and applications unlocks performance
The insight: Don't try to make distributed storage look like local storage. Embrace the difference and design accordingly.
CS 6500 — Big Data Analytics | Week 2
Design Principles
HDFS is a direct implementation of GFS ideas in open-source Java:
Write-once, read-many: Files are immutable after creation (append possible but limited)
Large files: Optimized for 100MB–TB files, not millions of KB files
Streaming access: Sequential reads dominate; random access is slow by design
Commodity hardware: Runs on standard x86 servers; failures expected and handled
Simple consistency: No POSIX semantics; relaxed model enables throughput
The trade-off: HDFS is spectacular at what it does and terrible at what it doesn't do.
CS 6500 — Big Data Analytics | Week 2
HDFS vs. Local FS
Local FS (ext4) / NFS
Single machine or file server
Random read/write optimized
POSIX-compliant (seek, mmap, etc.)
Small files: fine
Latency: milliseconds
Fault tolerance: RAID or nothing
Capacity: one server's worth
HDFS
Distributed across 100s–1000s of nodes
Sequential streaming optimized
Write-once semantics
Small files: actively bad
Latency: seconds (for metadata ops)
Fault tolerance: built-in replication
Capacity: add nodes as needed
CS 6500 — Big Data Analytics | Week 2
HDFS Architecture
NameNode, DataNodes, and how they work together
CS 6500 — Big Data Analytics | Week 2
Architecture
┌──────────────┐
│ NameNode │ ← metadata only
└──────┬───────┘
┌───────┼───────┐
▼ ▼ ▼
┌────────┐┌────────┐┌────────┐
│DataNode││DataNode││DataNode│
└────────┘└────────┘└────────┘
Clients contact NameNode once for metadata, then read/write DataNodes directly.
CS 6500 — Big Data Analytics | Week 2
The NameNode
Responsibilities:
Maintains the namespace tree (directories and files)
Stores metadata for every file (permissions, timestamps, block list)
Tracks block locations on DataNodes (in memory — not on disk)
Authorizes all filesystem operations
What NameNode does NOT do:
Store any actual file data
Directly participate in reads or writes (after the initial handshake)
Critical detail: All metadata lives in RAM. A cluster with billions of files needs a very large NameNode.
CS 6500 — Big Data Analytics | Week 2
NameNode Memory
Each file requires ~150 bytes of NameNode RAM (name + block list + locations)
Scenario
Files
NameNode RAM
1,000 large log files (1GB each)
1,000
~150 KB
1,000,000 compressed tweets (1KB each)
1,000,000
~150 MB
1,000,000,000 sensor readings (100B each)
1B
~150 GB
Same data volume, wildly different metadata cost.
CS 6500 — Big Data Analytics | Week 2
The DataNode
Responsibilities:
Store actual data blocks on local disk
Send heartbeats to NameNode every 3 seconds (I'm alive!)
Send block reports every hour (here's what I have)
Serve read requests from clients directly
Participate in replication pipeline for writes
On failure: NameNode detects missed heartbeat after 10 min, schedules re-replication automatically — transparent to users.
CS 6500 — Big Data Analytics | Week 2
Block-Based Storage
HDFS stores files as fixed-size blocks distributed across DataNodes
Default block size: 128 MB
A 400MB file becomes:
Block 1: 128MB → DataNode 2, 5, 7
Block 2: 128MB → DataNode 1, 3, 8
Block 3: 128MB → DataNode 4, 6, 2
Block 4: 16MB → DataNode 3, 7, 1
Why 128MB? Not arbitrary — it's a careful trade-off.
CS 6500 — Big Data Analytics | Week 2
Block Size
The trade-off between block size and system behavior:
Block Size
NameNode Metadata
Parallelism
Seek Overhead
1MB
1,000 blocks per GB
High
Negligible
64MB
16 blocks per GB
Moderate
Low
128MB
8 blocks per GB
Good
Very low
512MB
2 blocks per GB
Low
None
The winner: 128MB minimizes NameNode memory while preserving enough parallelism for large analytical jobs.
Rule of thumb: Each block can be processed by one mapper. Too few blocks = too little parallelism.
CS 6500 — Big Data Analytics | Week 2
Replication
Default: 3 copies of every block on different DataNodes:
Block A → [DataNode 1, DataNode 4, DataNode 7]
Block B → [DataNode 2, DataNode 5, DataNode 8]
Block C → [DataNode 3, DataNode 6, DataNode 9]
Up to 2 DataNodes can fail simultaneously without data loss
Any surviving DataNode can serve read requests
Re-replication happens automatically when nodes fail
Storage cost: 3× raw storage, but commodity disks are cheap.
CS 6500 — Big Data Analytics | Week 2
Rack Awareness
The problem: If all 3 replicas land in the same rack, a power failure kills all copies.
HDFS rack-aware placement strategy (default):
Replica 1: Writer's node (or random node in writer's rack)
Replica 2: Different rack from replica 1
Replica 3: Same rack as replica 2, different node
Why this works:
Survives individual node failures (replicas on different nodes)
Survives entire rack failure (replicas on 2+ racks)
Keeps replica 2 and 3 on same rack to reduce cross-rack write traffic
CS 6500 — Big Data Analytics | Week 2
Read Operation
Client reads /data/logs/access.log:
Client → NameNode: "I want to read /data/logs/access.log"
NameNode → Client: "Here are the block IDs and their DataNode locations"
Client → DataNode (nearest with Block 1): "Send me Block 1"
DataNode → Client: streams block data
Client → DataNode (nearest with Block 2): "Send me Block 2"
... (repeat for all blocks, potentially from different DataNodes)
Key insight: NameNode is consulted once for metadata, then the client talks directly to DataNodes. NameNode never becomes a data bottleneck.
CS 6500 — Big Data Analytics | Week 2
Write Operation
Client writes a new file:
Client → NameNode: "I want to create /data/new_file.txt"
NameNode → Client: "Write Block 1 to DN2; DN2 will forward to DN5; DN5 to DN8"
Client → DN2: streams data (DN2 simultaneously forwards to DN5, DN5 to DN8)
DN8 → DN5 → DN2 → Client: acknowledgments flow back through the pipeline
Client → NameNode: "Block 1 written successfully"
Repeat for each block
Pipeline replication: Data flows in one direction; acknowledgments flow back. Efficient use of network.
CS 6500 — Big Data Analytics | Week 2
Trace HDFS Ops
Apply the architecture — trace a real read request
CS 6500 — Big Data Analytics | Week 2
HDFS Read Trace
Pair up. Scenario: Client reads /data/logs/access.log
512MB file, 128MB blocks, replication=3, 8 DataNodes across 2 racks
Tasks:
How many blocks does this file have?
Draw the Client NameNode DataNode message sequence
Which messages carry metadata? Which carry data?
Bonus: DataNode 3 fails mid-read — what happens?
One pair presents to the class.
CS 6500 — Big Data Analytics | Week 2
Discussion Points
Expected answers:
Blocks: ⌈512 / 128⌉ = 4 blocks
Metadata flow: Client → NameNode (once), then client reads directly from DataNodes
Failure recovery: Client retries on a different DataNode holding a replica — transparent to the application
The big insight: NameNode handles metadata only — the metadata layer never becomes a data throughput bottleneck.
CS 6500 — Big Data Analytics | Week 2
Data Locality
The principle that makes distributed computing practical
CS 6500 — Big Data Analytics | Week 2
Move Computation
The core insight of data-local scheduling:
Traditional approach (move data):
Data lives on storage servers
Compute jobs run on compute servers
Every job requires moving data over the network
Network becomes the bottleneck
HDFS approach (move computation):
Data stored on the same nodes that do computation
Scheduler launches tasks on nodes that hold the data
Data read from local disk (GBps) not network (Mbps)
Network used only for results, not raw data
CS 6500 — Big Data Analytics | Week 2
Locality Tiers
The scheduler tries to place tasks in this priority order:
Node-local — task runs on the exact DataNode holding the block
Read from local disk: ~500 MB/sec
Rack-local — task runs on a different node in the same rack
Read via rack switch: ~1 Gbps = ~125 MB/sec
Off-rack — task runs on any available node
Read via core switch: ~1 Gbps shared among all rack traffic
Real impact: Node-local vs. off-rack can be a 4× throughput difference on a busy cluster.
CS 6500 — Big Data Analytics | Week 2
MapReduce Locality
Why data locality matters for MapReduce:
Each mapper processes one HDFS block
YARN scheduler checks which nodes have that block
Launches the mapper task on a node-local DataNode when possible
If all local nodes are busy: waits briefly, then falls back to rack-local
The payoff: On a 1,000-node cluster analyzing 1PB of data with good locality:
Local reads at 500 MB/sec: ~24 hours of total I/O
Off-rack reads at 50 MB/sec: 240 hours of total I/O
Locality turns an infeasible job into a practical one.
CS 6500 — Big Data Analytics | Week 2
Session Preview
What's coming in Session 2
CS 6500 — Big Data Analytics | Week 2
HDFS CLI Preview
Session 2 focus: Hands-on HDFS operations in the Docker cluster
hdfs dfs -ls /
hdfs dfs -mkdir -p /user/student/data
CS 6500 — Big Data Analytics | Week 2
CLI Preview (2)
hdfs dfs -put local_file.txt /user/student/
hdfs dfs -get /user/student/file.txt ./
hdfs dfs -cat /user/student/file.txt
hdfs dfs -stat "%n %b %r" /user/student/file.txt
hdfs dfs -setrep -w 2 /user/student/file.txt
hdfs fsck /user/student/ -files -blocks
CS 6500 — Big Data Analytics | Week 2
Session 1 Takeaways
What you should now understand:
Why HDFS exists: Traditional file systems can't scale horizontally; HDFS was designed from scratch for distributed commodity hardware
NameNode: Metadata only, in RAM — never touches data; single point of coordination (and historical SPOF)
DataNodes: Store blocks, send heartbeats, serve clients directly after metadata lookup
CS 6500 — Big Data Analytics | Week 2
Takeaways (2)
Block size (128MB): Balances NameNode memory, parallelism, and seek overhead
Replication (3×) with rack awareness: Tolerates node and rack failures automatically
Data locality: Schedule computation where data lives — turns network bottleneck into disk throughput
CS 6500 — Big Data Analytics | Week 2
What's Missing?
HDFS stores data reliably at scale — but it cannot answer a single question about that data on its own
CS 6500 — Big Data Analytics | Week 2
HDFS Gaps
No computation model — HDFS is a file system, not a processor; knowing where your 10PB of log data lives tells you nothing about how to count, filter, or aggregate it
No query interface — there's no SQL, no API for "give me all rows where status=500"; data locality is valuable only when paired with a computation engine that exploits it
The small files problem — pipelines that stream many small files per second will saturate NameNode RAM even if total data volume is modest; you need an ingestion strategy
Write-once limitations — HDFS is optimized for immutable files; updating a single record requires rewriting the entire file, making it unsuitable for transactional workloads
CS 6500 — Big Data Analytics | Week 2
What Comes Next
Gap
Solution
When
How to run computation where data lives
MapReduce — data-local parallel processing
Week 3
SQL queries on HDFS data
Hive — SQL interface over HDFS
Week 7
Interactive, iterative processing
Apache Spark — in-memory DAG on top of HDFS
Week 5–7
HDFS is the right tool for distributed, fault-tolerant storage of large files — but the moment you want to compute anything, you need a processing engine that understands data locality.
CS 6500 — Big Data Analytics | Week 2
Next Session
Session 2: HDFS Operations and Programming
HDFS CLI: navigate, upload, download, inspect, administer
Scavenger hunt: hands-on command practice in the cluster
Python hdfs library: programmatic access from Jupyter
PySpark + HDFS: reading/writing DataFrames
Homework walkthrough and work time
Before Session 2: Make sure Docker cluster is running — docker compose up -d then verify with hdfs dfs -ls /.
CS 6500 — Big Data Analytics | Week 2
Speaker context: This session builds the architectural foundation students need before writing any MapReduce or Spark code. HDFS is not just a storage layer—it encodes fundamental distributed systems trade-offs (fault tolerance vs. consistency, throughput vs. latency, metadata cost vs. parallelism). Push students to reason about design choices rather than memorize commands. The key question to keep returning to: "Why did they make this decision instead of the obvious obvious alternative?"
Speaker notes: The GFS paper is in your required readings. Section 2 (Design Overview) is the essential part. The authors explicitly say "we have re-examined traditional choices and explored radically different points in the design space." That's the intellectual move we want students to make throughout this course.
Speaker notes: The NameNode memory constraint is real. Each file/block consumes ~150 bytes of NameNode RAM. 1 billion files = 150GB RAM just for metadata. This is why "millions of small files" is an HDFS anti-pattern — it's a NameNode OOM problem, not a disk problem.
Speaker notes: Draw this on the board. "Imagine a 3-rack cluster. Replica 1 goes to Rack A, Node 5. Replica 2 goes to Rack B, Node 12. Replica 3 goes to Rack B, Node 18. Now if Rack A's switch dies, you still have 2 copies. If Rack B dies, you still have 1 copy. Only losing both racks simultaneously loses data — and that's very unlikely."
Speaker notes: Design question — Why not stream all blocks through the NameNode? Answer: NameNode would immediately become the bottleneck — all bandwidth through one machine.
Speaker notes: docker compose up -d && docker exec -it namenode hdfs dfs -ls /