Distributed File Systems and HDFS

CS 6500 — Week 2, Session 1

CS 6500 — Big Data Analytics | Week 2

Week 1 Recap

What we covered:

  • Big data landscape and the V's (Volume, Velocity, Variety)
  • The Hadoop ecosystem overview
  • Docker environment setup and first contact with the cluster

Key insight: Traditional databases hit walls at scale — we need new architectures designed for distributed commodity hardware.

Any environment setup issues? Quick show of hands — we'll troubleshoot during break.

CS 6500 — Big Data Analytics | Week 2

The Question

"Google had petabytes of web crawl data in 2003 and a cluster of 1,000 commodity servers. They couldn't afford enterprise storage. Why didn't they just use NFS — and what did they have to invent instead?"

CS 6500 — Big Data Analytics | Week 2

Today's Path

HDFS is the answer — and understanding why means understanding what NFS got wrong.

Why traditional file systems fail at scale
Single machines, random-access assumptions, and failure rates that make enterprise storage economics impossible at petabyte scale

How HDFS was designed instead
NameNode, DataNodes, 128MB blocks, 3× replication, rack awareness — each design choice is a deliberate response to a specific failure of traditional storage

By end of session you'll be able to trace a read request through the cluster and explain every design decision along the way.

CS 6500 — Big Data Analytics | Week 2

Why FS Can't Scale

The wall that every growing system eventually hits

CS 6500 — Big Data Analytics | Week 2

The Scale Problem

A typical enterprise challenge:

  • 10TB of web server logs generated per day
  • Need to analyze 90 days of history (900TB)
  • Query: "Find all sessions from users who later converted to paid"

On a single machine:

  • 900TB doesn't fit on one disk
  • Even if it did: reading at 500MB/sec → 21 days to scan once
  • RAID adds capacity but not parallel throughput

The fundamental constraint: I/O bandwidth is physical, not a software problem.

CS 6500 — Big Data Analytics | Week 2

FS Assumptions

Local file systems (ext4, NTFS, APFS) assume:

  • One machine owns the storage
  • Random access is common (seek + read small records)
  • Files change frequently (overwrite, append, truncate)
  • Concurrency is managed with locks
  • Failure is rare — hardware is enterprise-grade

These assumptions break completely at petabyte scale on commodity clusters.

CS 6500 — Big Data Analytics | Week 2

DFS Requirements

What a file system for big data must provide:

Requirement Why It Matters
Horizontal scalability Add capacity by adding nodes, not replacing hardware
Fault tolerance Commodity nodes fail daily on 1000-node clusters
High aggregate throughput Stream data fast across many disks in parallel
Large file support Analytics workloads operate on huge datasets
Data locality Move computation to data, not data to computation
CS 6500 — Big Data Analytics | Week 2

GFS (2003)

GFS Paper (Ghemawat, Gobioff, Leung) — the blueprint for HDFS

Key observations Google made:

  1. Component failures are the norm, not the exception
  2. Files are mostly large (multi-GB), not millions of tiny files
  3. Files are written once (or appended), then read many times
  4. Co-designing the file system and applications unlocks performance

The insight: Don't try to make distributed storage look like local storage. Embrace the difference and design accordingly.

CS 6500 — Big Data Analytics | Week 2

Design Principles

HDFS is a direct implementation of GFS ideas in open-source Java:

  • Write-once, read-many: Files are immutable after creation (append possible but limited)
  • Large files: Optimized for 100MB–TB files, not millions of KB files
  • Streaming access: Sequential reads dominate; random access is slow by design
  • Commodity hardware: Runs on standard x86 servers; failures expected and handled
  • Simple consistency: No POSIX semantics; relaxed model enables throughput

The trade-off: HDFS is spectacular at what it does and terrible at what it doesn't do.

CS 6500 — Big Data Analytics | Week 2

HDFS vs. Local FS

Local FS (ext4) / NFS

  • Single machine or file server
  • Random read/write optimized
  • POSIX-compliant (seek, mmap, etc.)
  • Small files: fine
  • Latency: milliseconds
  • Fault tolerance: RAID or nothing
  • Capacity: one server's worth

HDFS

  • Distributed across 100s–1000s of nodes
  • Sequential streaming optimized
  • Write-once semantics
  • Small files: actively bad
  • Latency: seconds (for metadata ops)
  • Fault tolerance: built-in replication
  • Capacity: add nodes as needed
CS 6500 — Big Data Analytics | Week 2

HDFS Architecture

NameNode, DataNodes, and how they work together

CS 6500 — Big Data Analytics | Week 2

Architecture

     ┌──────────────┐
     │  NameNode    │  ← metadata only
     └──────┬───────┘
    ┌───────┼───────┐
    ▼       ▼       ▼
┌────────┐┌────────┐┌────────┐
│DataNode││DataNode││DataNode│
└────────┘└────────┘└────────┘

Clients contact NameNode once for metadata, then read/write DataNodes directly.

CS 6500 — Big Data Analytics | Week 2

The NameNode

Responsibilities:

  • Maintains the namespace tree (directories and files)
  • Stores metadata for every file (permissions, timestamps, block list)
  • Tracks block locations on DataNodes (in memory — not on disk)
  • Authorizes all filesystem operations

What NameNode does NOT do:

  • Store any actual file data
  • Directly participate in reads or writes (after the initial handshake)

Critical detail: All metadata lives in RAM. A cluster with billions of files needs a very large NameNode.

CS 6500 — Big Data Analytics | Week 2

NameNode Memory

Each file requires ~150 bytes of NameNode RAM (name + block list + locations)

Scenario Files NameNode RAM
1,000 large log files (1GB each) 1,000 ~150 KB
1,000,000 compressed tweets (1KB each) 1,000,000 ~150 MB
1,000,000,000 sensor readings (100B each) 1B ~150 GB

Same data volume, wildly different metadata cost.

CS 6500 — Big Data Analytics | Week 2

The DataNode

Responsibilities:

  • Store actual data blocks on local disk
  • Send heartbeats to NameNode every 3 seconds (I'm alive!)
  • Send block reports every hour (here's what I have)
  • Serve read requests from clients directly
  • Participate in replication pipeline for writes

On failure: NameNode detects missed heartbeat after 10 min, schedules re-replication automatically — transparent to users.

CS 6500 — Big Data Analytics | Week 2

Block-Based Storage

HDFS stores files as fixed-size blocks distributed across DataNodes

Default block size: 128 MB

A 400MB file becomes:

Block 1: 128MB → DataNode 2, 5, 7
Block 2: 128MB → DataNode 1, 3, 8
Block 3: 128MB → DataNode 4, 6, 2
Block 4:  16MB → DataNode 3, 7, 1

Why 128MB? Not arbitrary — it's a careful trade-off.

CS 6500 — Big Data Analytics | Week 2

Block Size

The trade-off between block size and system behavior:

Block Size NameNode Metadata Parallelism Seek Overhead
1MB 1,000 blocks per GB High Negligible
64MB 16 blocks per GB Moderate Low
128MB 8 blocks per GB Good Very low
512MB 2 blocks per GB Low None

The winner: 128MB minimizes NameNode memory while preserving enough parallelism for large analytical jobs.

Rule of thumb: Each block can be processed by one mapper. Too few blocks = too little parallelism.

CS 6500 — Big Data Analytics | Week 2

Replication

Default: 3 copies of every block on different DataNodes:

Block A → [DataNode 1, DataNode 4, DataNode 7]
Block B → [DataNode 2, DataNode 5, DataNode 8]
Block C → [DataNode 3, DataNode 6, DataNode 9]
  • Up to 2 DataNodes can fail simultaneously without data loss
  • Any surviving DataNode can serve read requests
  • Re-replication happens automatically when nodes fail

Storage cost: 3× raw storage, but commodity disks are cheap.

CS 6500 — Big Data Analytics | Week 2

Rack Awareness

The problem: If all 3 replicas land in the same rack, a power failure kills all copies.

HDFS rack-aware placement strategy (default):

Replica 1: Writer's node (or random node in writer's rack)
Replica 2: Different rack from replica 1
Replica 3: Same rack as replica 2, different node

Why this works:

  • Survives individual node failures (replicas on different nodes)
  • Survives entire rack failure (replicas on 2+ racks)
  • Keeps replica 2 and 3 on same rack to reduce cross-rack write traffic
CS 6500 — Big Data Analytics | Week 2

Read Operation

Client reads /data/logs/access.log:

  1. Client → NameNode: "I want to read /data/logs/access.log"
  2. NameNode → Client: "Here are the block IDs and their DataNode locations"
  3. Client → DataNode (nearest with Block 1): "Send me Block 1"
  4. DataNode → Client: streams block data
  5. Client → DataNode (nearest with Block 2): "Send me Block 2"
  6. ... (repeat for all blocks, potentially from different DataNodes)

Key insight: NameNode is consulted once for metadata, then the client talks directly to DataNodes. NameNode never becomes a data bottleneck.

CS 6500 — Big Data Analytics | Week 2

Write Operation

Client writes a new file:

  1. Client → NameNode: "I want to create /data/new_file.txt"
  2. NameNode → Client: "Write Block 1 to DN2; DN2 will forward to DN5; DN5 to DN8"
  3. Client → DN2: streams data (DN2 simultaneously forwards to DN5, DN5 to DN8)
  4. DN8 → DN5 → DN2 → Client: acknowledgments flow back through the pipeline
  5. Client → NameNode: "Block 1 written successfully"
  6. Repeat for each block

Pipeline replication: Data flows in one direction; acknowledgments flow back. Efficient use of network.

CS 6500 — Big Data Analytics | Week 2

Trace HDFS Ops

Apply the architecture — trace a real read request

CS 6500 — Big Data Analytics | Week 2

HDFS Read Trace

Pair up. Scenario: Client reads /data/logs/access.log

  • 512MB file, 128MB blocks, replication=3, 8 DataNodes across 2 racks

Tasks:

  1. How many blocks does this file have?
  2. Draw the Client ↔ NameNode ↔ DataNode message sequence
  3. Which messages carry metadata? Which carry data?
  4. Bonus: DataNode 3 fails mid-read — what happens?

One pair presents to the class.

CS 6500 — Big Data Analytics | Week 2

Discussion Points

Expected answers:

  • Blocks: ⌈512 / 128⌉ = 4 blocks
  • Metadata flow: Client → NameNode (once), then client reads directly from DataNodes
  • Failure recovery: Client retries on a different DataNode holding a replica — transparent to the application

The big insight: NameNode handles metadata only — the metadata layer never becomes a data throughput bottleneck.

CS 6500 — Big Data Analytics | Week 2

Data Locality

The principle that makes distributed computing practical

CS 6500 — Big Data Analytics | Week 2

Move Computation

The core insight of data-local scheduling:

Traditional approach (move data):

  1. Data lives on storage servers
  2. Compute jobs run on compute servers
  3. Every job requires moving data over the network
  4. Network becomes the bottleneck

HDFS approach (move computation):

  1. Data stored on the same nodes that do computation
  2. Scheduler launches tasks on nodes that hold the data
  3. Data read from local disk (GBps) not network (Mbps)
  4. Network used only for results, not raw data
CS 6500 — Big Data Analytics | Week 2

Locality Tiers

The scheduler tries to place tasks in this priority order:

  1. Node-local — task runs on the exact DataNode holding the block

    • Read from local disk: ~500 MB/sec
  2. Rack-local — task runs on a different node in the same rack

    • Read via rack switch: ~1 Gbps = ~125 MB/sec
  3. Off-rack — task runs on any available node

    • Read via core switch: ~1 Gbps shared among all rack traffic

Real impact: Node-local vs. off-rack can be a 4× throughput difference on a busy cluster.

CS 6500 — Big Data Analytics | Week 2

MapReduce Locality

Why data locality matters for MapReduce:

  • Each mapper processes one HDFS block
  • YARN scheduler checks which nodes have that block
  • Launches the mapper task on a node-local DataNode when possible
  • If all local nodes are busy: waits briefly, then falls back to rack-local

The payoff: On a 1,000-node cluster analyzing 1PB of data with good locality:

  • Local reads at 500 MB/sec: ~24 hours of total I/O
  • Off-rack reads at 50 MB/sec: 240 hours of total I/O

Locality turns an infeasible job into a practical one.

CS 6500 — Big Data Analytics | Week 2

Session Preview

What's coming in Session 2

CS 6500 — Big Data Analytics | Week 2

HDFS CLI Preview

Session 2 focus: Hands-on HDFS operations in the Docker cluster

# Navigation
hdfs dfs -ls /
hdfs dfs -mkdir -p /user/student/data
CS 6500 — Big Data Analytics | Week 2

CLI Preview (2)

# File operations
hdfs dfs -put local_file.txt /user/student/
hdfs dfs -get /user/student/file.txt ./

# Inspection
hdfs dfs -cat /user/student/file.txt
hdfs dfs -stat "%n %b %r" /user/student/file.txt

# Administration
hdfs dfs -setrep -w 2 /user/student/file.txt
hdfs fsck /user/student/ -files -blocks
CS 6500 — Big Data Analytics | Week 2

Session 1 Takeaways

What you should now understand:

  1. Why HDFS exists: Traditional file systems can't scale horizontally; HDFS was designed from scratch for distributed commodity hardware

  2. NameNode: Metadata only, in RAM — never touches data; single point of coordination (and historical SPOF)

  3. DataNodes: Store blocks, send heartbeats, serve clients directly after metadata lookup

CS 6500 — Big Data Analytics | Week 2

Takeaways (2)

  1. Block size (128MB): Balances NameNode memory, parallelism, and seek overhead

  2. Replication (3×) with rack awareness: Tolerates node and rack failures automatically

  3. Data locality: Schedule computation where data lives — turns network bottleneck into disk throughput

CS 6500 — Big Data Analytics | Week 2

What's Missing?

HDFS stores data reliably at scale — but it cannot answer a single question about that data on its own

CS 6500 — Big Data Analytics | Week 2

HDFS Gaps

  • No computation model — HDFS is a file system, not a processor; knowing where your 10PB of log data lives tells you nothing about how to count, filter, or aggregate it
  • No query interface — there's no SQL, no API for "give me all rows where status=500"; data locality is valuable only when paired with a computation engine that exploits it
  • The small files problem — pipelines that stream many small files per second will saturate NameNode RAM even if total data volume is modest; you need an ingestion strategy
  • Write-once limitations — HDFS is optimized for immutable files; updating a single record requires rewriting the entire file, making it unsuitable for transactional workloads
CS 6500 — Big Data Analytics | Week 2

What Comes Next

Gap Solution When
How to run computation where data lives MapReduce — data-local parallel processing Week 3
SQL queries on HDFS data Hive — SQL interface over HDFS Week 7
Interactive, iterative processing Apache Spark — in-memory DAG on top of HDFS Week 5–7

HDFS is the right tool for distributed, fault-tolerant storage of large files — but the moment you want to compute anything, you need a processing engine that understands data locality.

CS 6500 — Big Data Analytics | Week 2

Next Session

Session 2: HDFS Operations and Programming

  • HDFS CLI: navigate, upload, download, inspect, administer
  • Scavenger hunt: hands-on command practice in the cluster
  • Python hdfs library: programmatic access from Jupyter
  • PySpark + HDFS: reading/writing DataFrames
  • Homework walkthrough and work time

Before Session 2: Make sure Docker cluster is running — docker compose up -d then verify with hdfs dfs -ls /.

CS 6500 — Big Data Analytics | Week 2

Speaker context: This session builds the architectural foundation students need before writing any MapReduce or Spark code. HDFS is not just a storage layer—it encodes fundamental distributed systems trade-offs (fault tolerance vs. consistency, throughput vs. latency, metadata cost vs. parallelism). Push students to reason about design choices rather than memorize commands. The key question to keep returning to: "Why did they make this decision instead of the obvious obvious alternative?"

Speaker notes: The GFS paper is in your required readings. Section 2 (Design Overview) is the essential part. The authors explicitly say "we have re-examined traditional choices and explored radically different points in the design space." That's the intellectual move we want students to make throughout this course.

Speaker notes: The NameNode memory constraint is real. Each file/block consumes ~150 bytes of NameNode RAM. 1 billion files = 150GB RAM just for metadata. This is why "millions of small files" is an HDFS anti-pattern — it's a NameNode OOM problem, not a disk problem.

Speaker notes: Draw this on the board. "Imagine a 3-rack cluster. Replica 1 goes to Rack A, Node 5. Replica 2 goes to Rack B, Node 12. Replica 3 goes to Rack B, Node 18. Now if Rack A's switch dies, you still have 2 copies. If Rack B dies, you still have 1 copy. Only losing both racks simultaneously loses data — and that's very unlikely."

15 minutes

Speaker notes: Design question — Why not stream all blocks through the NameNode? Answer: NameNode would immediately become the bottleneck — all bandwidth through one machine.

Speaker notes: docker compose up -d && docker exec -it namenode hdfs dfs -ls /