Roles & Setup

CS 6500 — Week 1, Session 2

CS 6500 — Big Data Analytics | Week 1

Driving Question

"A company hires ten people with 'Data Scientist' in their job title. Six months later, no model has shipped. Who is actually responsible for making data science work — and what does each person actually do all day?"

CS 6500 — Big Data Analytics | Week 1

Two Answers

Understanding roles is the prerequisite for understanding architecture.

Part 1: The Data Team Landscape
What each role does, where they overlap, and why the Data Engineer role is the unsung foundation that makes everything else possible

Part 2: Your Environment
The Hadoop ecosystem as a coherent map — and getting every service running on your laptop so that next week you can write real code against real distributed infrastructure

Without a functioning environment, everything else in this course is theory. This session closes that gap.

CS 6500 — Big Data Analytics | Week 1

Session 1 Recap

What we established:

  • Big data = problems that require rethinking your architecture, not just a size threshold
  • 5 Vs: Volume, Velocity, Variety, Veracity, Value
  • Data science lifecycle: Problem → Data → EDA → Model → Validate → Deploy (iterative)
  • Key roles: Data Engineer, Data Scientist, Analyst, ML Engineer

Today:

  • Deep dive on what data scientists/engineers actually do day-to-day
  • Map the Hadoop ecosystem and understand how the pieces fit together
  • Get your Docker cluster running — you leave with a working environment
CS 6500 — Big Data Analytics | Week 1

Data Scientist Role

What you'll actually spend your time on

CS 6500 — Big Data Analytics | Week 1

The DS Reality

The glamorous version (what gets the headlines):
Build machine learning models that predict the future, recommend products, and detect fraud.

The realistic version (what fills the calendar):

Activity Estimated Time
Data collection, cleaning, validation 40–60%
Exploratory analysis and visualization 15–20%
Feature engineering 10–15%
Model training and tuning 5–10%
Communication and documentation 10–15%

Source: CrowdFlower/Figure Eight Data Scientist Report; Anaconda State of Data Science

CS 6500 — Big Data Analytics | Week 1

The Skills Triangle

Every data scientist needs three domains — mastery in any two is rare:

Skills Triangle

Solo domains fall short:

  • Math/Stats only: Can model, but can't implement or interpret business impact
  • CS only: Can build pipelines, but models are black boxes
  • Domain only: Understands the problem, can't build the solution

The intersections:

  • Math + CS = Machine Learning Engineer
  • Math + Domain = Research Scientist
  • CS + Domain = Data Engineer

The sweet spot: All three = Data Scientist

CS 6500 — Big Data Analytics | Week 1

Roles Compared

Data Engineer

  • Builds data infrastructure (pipelines, warehouses, lakes)
  • Languages: SQL, Python, Scala, Java
  • Tools: Spark, Airflow, Kafka, HDFS, S3
  • Output: reliable, clean data available to others
  • This course is heavily DE-focused

Data Scientist

  • Extracts insight and builds predictive models
  • Languages: Python, R, SQL
  • Tools: scikit-learn, PyTorch, Spark MLlib
  • Output: models, experiments, recommendations

Data Analyst

  • Answers specific business questions
  • Languages: SQL primarily, Python/R for viz
  • Tools: Tableau, Power BI, dbt, Excel
  • Output: dashboards, reports, ad-hoc analyses

ML Engineer

  • Takes a data scientist's model to production
  • Languages: Python, Go, C++
  • Tools: Docker, Kubernetes, MLflow, TensorFlow Serving
  • Output: real-time inference APIs, monitoring systems
CS 6500 — Big Data Analytics | Week 1

Data Ethics

Data scientists have real power — and real responsibility.

Common ethical pitfalls:

  • Algorithmic bias: Models trained on historical data perpetuate past discrimination (hiring, lending, criminal justice)
  • Privacy violations: Re-identification of "anonymized" data; unauthorized use of personal data
  • Misrepresentation: Cherry-picking metrics, misleading visualizations, overstating model confidence
  • Unintended consequences: A model optimized for engagement may maximize outrage

Graduate-level expectation: You will be asked to flag ethical issues before they become problems. This is part of your job.

Frameworks: GDPR (EU), CCPA (California), HIPAA (healthcare) — we'll touch on these in Weeks 13–14.

CS 6500 — Big Data Analytics | Week 1

Big Data Ecosystem

A map of the tools we'll use — and how they fit together

CS 6500 — Big Data Analytics | Week 1

Hadoop Ecosystem

Hadoop Ecosystem

CS 6500 — Big Data Analytics | Week 1

What YARN Does

YARN (Yet Another Resource Negotiator) — the cluster operating system

  • Resource Manager: Tracks available CPU/RAM across all nodes; accepts job submissions
  • Node Manager: Per-node agent; reports resources; launches task containers
  • Application Master: Per-job agent; negotiates resources; monitors task progress

The key abstraction: YARN separates resource management from computation. MapReduce, Spark, Flink, and Hive all run as YARN applications on the same cluster.

Web UI: http://localhost:8088 — see running jobs, resource usage, logs

CS 6500 — Big Data Analytics | Week 1

Processing Layers

Batch Processing

  • MapReduce — disk-based, reliable, high latency (Weeks 3–4)
  • Apache Spark — memory-optimized, 10–100× faster, the modern standard (Weeks 5–7)
  • Apache Hive — SQL interface over HDFS/Spark (Week 7)

Stream Processing

  • Apache Flink — stateful stream processing
  • Spark Streaming — micro-batch streaming
  • (covered briefly in Week 15)

Query Engines

  • Trino (formerly Presto) — federated SQL across HDFS, S3, databases
  • DuckDB — in-process analytical SQL, excellent for medium-scale (Week 12)
  • Apache Pig — dataflow scripting (Week 7)

Storage

  • HDFS — distributed file system (Week 2)
  • HBase — NoSQL on HDFS (mentioned in Week 11)
  • MongoDB, Cassandra — standalone NoSQL (Weeks 10–11)
CS 6500 — Big Data Analytics | Week 1

The Docker Stack

Everything runs in Docker — no cloud account needed for Weeks 1–12.

┌──────────────────────────────────────────────────────────────┐
│  Hadoop:  namenode:9870 · datanode1/2/3 · historyserver:8188 │
│           resourcemanager:8088 · nodemanager:8042            │
│  Spark:   spark-master:8080 · spark-worker:8082              │
│  Dev:     jupyter:8888 (token: bigdata) · app-ui:4040        │
│  SQL:     hiveserver2:10002 · hbase-master:16010             │
│  NoSQL:   mongodb:27017 · cassandra:9042 · redis:6379        │
│  Stream:  kafka-broker:9092 · flink-jobmanager:8081          │
│  Tools:   zookeeper:2181 · pig · nifi:18080                  │
└──────────────────────────────────────────────────────────────┘

~20 containers total. Week 1 uses Hadoop + Spark + Jupyter. Others introduced as needed.

CS 6500 — Big Data Analytics | Week 1

Cloud Platforms

In industry, the Docker stack scales to:

Docker Component AWS Equivalent Purpose
HDFS Amazon S3 Distributed object storage
YARN + Spark AWS EMR Managed cluster computing
Hive AWS Glue / Athena Serverless SQL on S3
Airflow AWS MWAA / Step Functions Workflow orchestration
Jupyter SageMaker Managed notebook environment

Our philosophy: Master the fundamentals locally, then understand the managed versions in Week 13. Cloud services abstract the complexity we'll have already learned.

CS 6500 — Big Data Analytics | Week 1

Docker Workshop

Let's get your cluster running

CS 6500 — Big Data Analytics | Week 1

Setup Checklist

Verify these on your machine now:

# Check Docker is installed and running
docker --version        # Should show Docker 24.x or higher
docker compose version  # Should show Docker Compose 2.x

# Check available memory
# Mac: Apple menu → About This Mac → Memory
# Windows: Task Manager → Performance → Memory
# You need at least 8GB free for Docker

If Docker isn't installed: Download from docker.com/products/docker-desktop

If memory is tight:

  • Quit other applications
  • In Docker Desktop → Settings → Resources → set Memory to 8GB minimum
CS 6500 — Big Data Analytics | Week 1

Step 1: Get Repo

# Clone the course repository (link on Canvas)
git clone <course-repo-url>
cd cs6500-bigdata

# Verify the docker directory exists
ls docker/
# You should see: docker-compose.yml, Dockerfile, etc.

cd docker/

Alternatively: Download the ZIP from Canvas and extract it.

CS 6500 — Big Data Analytics | Week 1

Step 2: Launch

# Start all services in detached mode (background)
docker compose up -d

# This downloads ~4GB of images on first run — be patient
# Subsequent starts take ~30 seconds

# Check all containers are running
docker compose ps

Expected output (abridged — ~20 containers total):

NAME                STATUS    PORTS
namenode            running   0.0.0.0:9870->9870/tcp
datanode1           running
datanode2           running
datanode3           running
resourcemanager     running   0.0.0.0:8088->8088/tcp
nodemanager         running   0.0.0.0:8042->8042/tcp
spark-master        running   0.0.0.0:8080->8080/tcp
spark-worker        running   0.0.0.0:8082->8081/tcp
jupyter             running   0.0.0.0:8888->8888/tcp
...
CS 6500 — Big Data Analytics | Week 1

Step 3: Verify

Open these in your browser — all should load:

Service URL What You Should See
HDFS NameNode UI http://localhost:9870 Cluster summary, 3 DataNodes
YARN ResourceManager http://localhost:8088 Cluster metrics, no running apps
Spark Master UI http://localhost:8080 1 worker, resources available
Jupyter Notebook http://localhost:8888 Notebook browser (token: bigdata)

HDFS quick test:

docker exec -it namenode bash
hdfs dfs -ls /
# Should show: /tmp, /user, /data directories
exit
CS 6500 — Big Data Analytics | Week 1

Step 4: HDFS CLI

# Shell into the NameNode container
docker exec -it namenode bash

# Create a test file locally and upload it
echo "Hello Big Data! CS 6500 Spring 2026" > /tmp/hello.txt
hdfs dfs -mkdir -p /user/student
hdfs dfs -put /tmp/hello.txt /user/student/

# Verify
hdfs dfs -ls /user/student/
hdfs dfs -cat /user/student/hello.txt

exit

Expected output:

Found 1 items
-rw-r--r--   3 root supergroup  37 ... /user/student/hello.txt
Hello Big Data! CS 6500 Spring 2026
CS 6500 — Big Data Analytics | Week 1

Step 5: PySpark

Open http://localhost:8888 → New Notebook → Python 3

from pyspark.sql import SparkSession

spark = SparkSession.builder \
    .appName("Week1Test") \
    .getOrCreate()

# Create a simple DataFrame
data = [("Alice", 34), ("Bob", 45), ("Cathy", 29)]
df = spark.createDataFrame(data, ["Name", "Age"])
df.show()

# Read from HDFS
text_df = spark.read.text("hdfs://namenode:9000/user/student/hello.txt")
text_df.show()

spark.stop()
print("PySpark test: PASSED")
CS 6500 — Big Data Analytics | Week 1

Troubleshooting

Container won't start:

# Check for port conflicts
lsof -i :9870
lsof -i :8088
# Kill conflicting processes or change ports

# Restart fresh
docker compose down
docker compose up -d

NameNode in safe mode:

docker exec namenode \
  hdfs dfsadmin -safemode leave

Not enough memory:

# Docker Desktop → Settings → Resources
# Increase memory to 8GB minimum
# Increase swap to 4GB

M1/M2 Mac issues:

# In docker-compose.yml, add to each service:
platform: linux/amd64

Jupyter won't connect to Spark:

# Check spark master is running
docker compose ps spark-master
docker compose restart spark-master
CS 6500 — Big Data Analytics | Week 1

Env Verification

Completing this week's deliverable

CS 6500 — Big Data Analytics | Week 1

What to Submit

Take screenshots of each — submit as a single PDF on Canvas.

Required screenshots:

  1. docker compose ps output showing containers running
  2. HDFS NameNode UI (http://localhost:9870) showing 3 DataNodes
  3. YARN ResourceManager UI (http://localhost:8088) cluster overview
  4. Spark Master UI (http://localhost:8080) showing 1 worker
  5. Jupyter Notebook running the PySpark test with df.show() output visible

Optional (for troubleshooting documentation):

  • docker compose logs namenode if you had any issues

Submission: Upload PDF to Canvas → Week 1 → Environment Verification
Due: Sunday 11:59 PM (ungraded — but required to confirm your setup works)

CS 6500 — Big Data Analytics | Week 1

The HDFS UI

http://localhost:9870 — walk through together:

  • Overview tab: Cluster health, configured capacity, used capacity, replication factor
  • Datanodes tab: List of 3 connected DataNodes with heartbeat status
  • Browse the filesystem: Navigate directories, view file metadata
  • Logs: Access NameNode logs for debugging

Live demo: Browse to /user/student/hello.txt and inspect its block locations.

CS 6500 — Big Data Analytics | Week 1

The YARN UI

http://localhost:8088 — walk through together:

  • Cluster → About: Cluster metrics (memory, cores, nodes)
  • Applications → Running: Active Spark/MapReduce jobs
  • Applications → Finished: Historical jobs with status and logs
  • Scheduler: Resource allocation policy

When you run a PySpark job next week, you'll see it appear here in real time. The application log links are invaluable for debugging job failures.

CS 6500 — Big Data Analytics | Week 1

Week 1 Deliverables

Due this week:

Deliverable Due Points
Environment verification (PDF screenshots) Sunday 11:59 PM Ungraded
Team preference survey (link on Canvas) Sunday 11:59 PM Ungraded

No graded assignment this week — but failing to get your environment working will hurt you starting Week 2.

If you can't get Docker running by Friday: Email me immediately with your error output. Do not wait until Sunday.

CS 6500 — Big Data Analytics | Week 1

Next Week Preview

Week 2: Distributed File Systems and HDFS

Session 1 — HDFS Architecture:

  • Why traditional file systems fail at scale
  • NameNode, DataNodes, block storage, replication
  • Rack awareness and fault tolerance
  • Read/write operation walkthroughs

Session 2 — HDFS Operations:

  • Full CLI walkthrough (hdfs dfs -ls, -put, -get, -stat, -setrep)
  • Scavenger hunt activity
  • Python hdfs library and PySpark HDFS integration
CS 6500 — Big Data Analytics | Week 1

Before Next Session

  • Read: Google File System paper (Sections 1–3) — linked on Canvas
  • Ensure your Docker stack starts reliably (docker compose up -d)
  • Explore the HDFS UI and browse the default directory structure
CS 6500 — Big Data Analytics | Week 1

Week 1 Takeaways

Big Data:

  • 5 Vs framework — but remember it's about architectural problems, not just size
  • Different tools for different problems: batch vs. stream, SQL vs. NoSQL, structured vs. unstructured

Roles:

  • Data Engineer builds the infrastructure that makes Data Science possible
  • This course gives you DE skills with DS context

The Ecosystem:

  • YARN orchestrates resources; HDFS stores data; Spark/MapReduce processes it
  • All of this runs locally in Docker for the first 12 weeks

Your setup:

  • You should have a running cluster, a passing PySpark test, and 4 browser tabs bookmarked
CS 6500 — Big Data Analytics | Week 1

What's Missing?

You have a running cluster — but files on HDFS are just bytes until you understand how the storage system actually works

CS 6500 — Big Data Analytics | Week 1

Cluster Gaps

  • No storage architecture understanding — you can docker compose up, but you don't yet know why NameNode and DataNode are separate containers, or what happens when one dies
  • No file system semantics — uploading a file to HDFS today is a black box; you don't yet know how it's split, replicated, or located across nodes
  • No data locality intuition — the cluster is running, but why does where your data lives on the cluster determine how fast your jobs run?
  • No first job — you have infrastructure but no computation; a cluster with nothing to process is an expensive heating system
CS 6500 — Big Data Analytics | Week 1

What Comes Next

Gap Solution When
How distributed storage actually works HDFS architecture — NameNode, DataNodes, blocks, replication Week 2, Session 1
CLI and programmatic HDFS access hdfs dfs commands + Python hdfs library Week 2, Session 2
Your first parallel computation MapReduce — process HDFS files across the cluster Week 3

The Docker cluster is your lab bench — it's ready; now you need to learn what to do with it.

CS 6500 — Big Data Analytics | Week 1

Speaker context: This session has two very distinct halves. The first 30 minutes are conceptual — roles, the ecosystem, and framing — but keep it moving. Students are eager to get to the hands-on Docker setup. The second half is a live lab workshop; expect 20% of students to hit snags. Have the troubleshooting slide ready to share. Aim for "everyone has running containers" as the success criterion — that's the deliverable.

Speaker notes: Walk through the image layer by layer. Bottom: storage (HDFS, HBase). Middle: resource management (YARN). Top: processing and query (MapReduce, Spark, Hive, Pig). Right side: ingestion (Kafka, Sqoop, Flume). This course covers the shaded areas.