Activity	Estimated Time
Data collection, cleaning, validation	40–60%
Exploratory analysis and visualization	15–20%
Feature engineering	10–15%
Model training and tuning	5–10%
Communication and documentation	10–15%

Docker Component	AWS Equivalent	Purpose
HDFS	Amazon S3	Distributed object storage
YARN + Spark	AWS EMR	Managed cluster computing
Hive	AWS Glue / Athena	Serverless SQL on S3
Airflow	AWS MWAA / Step Functions	Workflow orchestration
Jupyter	SageMaker	Managed notebook environment

Service	URL	What You Should See
HDFS NameNode UI	`http://localhost:9870`	Cluster summary, 3 DataNodes
YARN ResourceManager	`http://localhost:8088`	Cluster metrics, no running apps
Spark Master UI	`http://localhost:8080`	1 worker, resources available
Jupyter Notebook	`http://localhost:8888`	Notebook browser (token: `bigdata`)

Troubleshooting

Container won't start:

# Check for port conflicts
lsof -i :9870
lsof -i :8088
# Kill conflicting processes or change ports

# Restart fresh
docker compose down
docker compose up -d

NameNode in safe mode:

docker exec namenode \
  hdfs dfsadmin -safemode leave

Not enough memory:

# Docker Desktop → Settings → Resources
# Increase memory to 8GB minimum
# Increase swap to 4GB

M1/M2 Mac issues:

# In docker-compose.yml, add to each service:
platform: linux/amd64

Jupyter won't connect to Spark:

# Check spark master is running
docker compose ps spark-master
docker compose restart spark-master

Deliverable	Due	Points
Environment verification (PDF screenshots)	Sunday 11:59 PM	Ungraded
Team preference survey (link on Canvas)	Sunday 11:59 PM	Ungraded

Gap	Solution	When
How distributed storage actually works	HDFS architecture — NameNode, DataNodes, blocks, replication	Week 2, Session 1
CLI and programmatic HDFS access	`hdfs dfs` commands + Python `hdfs` library	Week 2, Session 2
Your first parallel computation	MapReduce — process HDFS files across the cluster	Week 3

Roles & Setup

CS 6500 — Week 1, Session 2

Driving Question

Two Answers

Session 1 Recap

Data Scientist Role

The DS Reality

The Skills Triangle

Roles Compared

Data Ethics

Big Data Ecosystem

Hadoop Ecosystem

What YARN Does

Processing Layers

The Docker Stack

Cloud Platforms

Docker Workshop

Setup Checklist

Step 1: Get Repo

Step 2: Launch

Step 3: Verify

Step 4: HDFS CLI

Step 5: PySpark

Troubleshooting

Env Verification

What to Submit

The HDFS UI

The YARN UI

Week 1 Deliverables

Next Week Preview

Before Next Session

Week 1 Takeaways

What's Missing?

Cluster Gaps

What Comes Next