Traditional RDBMS	Hadoop
Scale	GBs–TBs	TBs–PBs
Cost	Expensive hardware	Commodity clusters
Users	DBAs / developers	Data analysts (SQL!)
Access pattern	OLTP (row-level)	OLAP (batch scans)

Feature	RDBMS	Hive
Query language	SQL	HiveQL (SQL-like)
Data size	GBs–TBs	TBs–PBs
Latency	Milliseconds	Seconds–minutes
Updates	Row-level ACID	Append-mostly
Use case	OLTP	OLAP / analytics
Storage	Local disk	HDFS

	Hive	Spark SQL
Best for	Batch ETL, warehousing	Interactive analysis, iteration
Latency	Higher (MapReduce/Tez)	Lower (in-memory)
Metastore	Manages it	Reads it
ACID	Yes (v3.0+)	Limited
ML integration	No	MLlib, pandas UDFs

Category	What's stored
Schema	Table names, column names, data types
Partitions	Partition keys + their HDFS directories
Location	HDFS path for each table's data
SerDe	Serializer/Deserializer class (how rows are parsed)
Statistics	Row counts, column histograms (used by the optimizer)

Engine	When to use
`mr`	Default; reliable, slow batch jobs
`spark`	Faster interactive queries (Spark must be running)
`tez`	Fastest Hive-native option (not in our Docker stack)

The Bleeding Edge: Storage / Compute / Metadata as Separate Layers

Hive's original insight (2007): Decouple SQL from storage.

The problem Hive didn't fully solve: Still tightly coupled — HiveQL assumes HDFS; the metastore assumes specific SerDe formats; schema evolution is painful.

The modern endpoint: fully swappable, independent layers

Layer	Hive era	Modern
Storage	HDFS	S3 / GCS / Azure Blob / HDFS
Metadata	Hive Metastore	Iceberg catalog, AWS Glue, Project Nessie
Compute	MapReduce → Spark	Spark, Trino, Flink, DuckDB — any engine

Apache Iceberg (2018) is Hive's logical successor:

Open table format readable by any compute engine
First-class ACID transactions, time travel, and schema evolution
The metastore just stores a pointer; Iceberg manages its own metadata files

The principle: Data should outlive any single query engine. Design for interchange — not lock-in.

Apache Hive — SQL on Hadoop

CS 6500 — Week 7, Session 1

Week 6 Recap + Assignment 2

The Data Warehousing Problem

What Is Apache Hive?

Hive vs. Traditional RDBMS

Hive vs. Spark SQL

Hive Architecture

The Hive Metastore — What It Stores

The Metastore — Historical Significance

How a Hive Query Executes

Managed vs. External Tables

Schema on Read vs. Schema on Write

Schema on Read — In Practice

CSVs Directly from HDFS — Trade-offs

Demo: Connecting to Hive (Beeline)

Demo: Creating Tables

Demo: External Table + Load Data

Demo: Basic Analytics Queries

Why Is Hive Slow?

Switching the Execution Engine

Switching the Execution Engine

Activity: Hive Query Challenge

Activity Debrief

The Bleeding Edge: Storage / Compute / Metadata as Separate Layers

Session 1 Key Takeaways

Looking Ahead: Week 8