Distributed File Systems and HDFS

Speaker context: This session builds the architectural foundation students need before writing any MapReduce or Spark code. HDFS is not just a storage layer—it encodes fundamental distributed systems trade-offs (fault tolerance vs. consistency, throughput vs. latency, metadata cost vs. parallelism). Push students to reason about design choices rather than memorize commands. The key question to keep returning to: "Why did they make this decision instead of the obvious obvious alternative?"

Speaker notes: The GFS paper is in your required readings. Section 2 (Design Overview) is the essential part. The authors explicitly say "we have re-examined traditional choices and explored radically different points in the design space." That's the intellectual move we want students to make throughout this course.

Speaker notes: The NameNode memory constraint is real. Each file/block consumes ~150 bytes of NameNode RAM. 1 billion files = 150GB RAM just for metadata. This is why "millions of small files" is an HDFS anti-pattern — it's a NameNode OOM problem, not a disk problem.

Speaker notes: Draw this on the board. "Imagine a 3-rack cluster. Replica 1 goes to Rack A, Node 5. Replica 2 goes to Rack B, Node 12. Replica 3 goes to Rack B, Node 18. Now if Rack A's switch dies, you still have 2 copies. If Rack B dies, you still have 1 copy. Only losing both racks simultaneously loses data — and that's very unlikely."

15 minutes

Speaker notes: Design question — Why not stream all blocks through the NameNode? Answer: NameNode would immediately become the bottleneck — all bandwidth through one machine.

Speaker notes: docker compose up -d && docker exec -it namenode hdfs dfs -ls /