Apache Hive — SQL on Hadoop

CS 6500 — Week 7, Session 1

Week 6 Recap + Assignment 2

Quick recap: Spark DataFrames & SQL

  • DataFrames = distributed tables with schema
  • Spark SQL: spark.sql("SELECT ...") on registered views
  • Query optimizer (Catalyst) handles physical planning

The Data Warehousing Problem

Why can't we just use a relational database?

Traditional RDBMS Hadoop
Scale GBs–TBs TBs–PBs
Cost Expensive hardware Commodity clusters
Users DBAs / developers Data analysts (SQL!)
Access pattern OLTP (row-level) OLAP (batch scans)

The gap: Analysts know SQL. HDFS is a filesystem. Who bridges them?

What Is Apache Hive?

"SQL on Hadoop" — brings the data warehouse to HDFS

  • Translates HiveQL (SQL dialect) → MapReduce / Tez / Spark jobs
  • Created at Facebook (2007) to let analysts query petabytes without writing Java
  • Now a core Apache project

HiveQL is mostly standard SQL:

SELECT category, SUM(price * quantity) AS revenue
FROM transactions
GROUP BY category
ORDER BY revenue DESC;

This runs as a distributed job across the cluster.

Hive vs. Traditional RDBMS

Feature RDBMS Hive
Query language SQL HiveQL (SQL-like)
Data size GBs–TBs TBs–PBs
Latency Milliseconds Seconds–minutes
Updates Row-level ACID Append-mostly
Use case OLTP OLAP / analytics
Storage Local disk HDFS

Rule of thumb: If it fits in Postgres, use Postgres. If it's terabytes of append-only logs, use Hive.

Hive vs. Spark SQL

Hive Spark SQL
Best for Batch ETL, warehousing Interactive analysis, iteration
Latency Higher (MapReduce/Tez) Lower (in-memory)
Metastore Manages it Reads it
ACID Yes (v3.0+) Limited
ML integration No MLlib, pandas UDFs

Key insight: They share the same metastore. Spark SQL reads "Hive tables" constantly in production.

Hive Architecture

Client (Beeline / JDBC / Thrift)
         │
         ▼
    ┌─────────────────────────────┐
    │          Driver             │
    │ Parser → Planner → Optimizer│
    └──────────┬──────────────────┘
               │
    ┌──────────▼──────────┐
    │    Metastore        │  ← Schema, partitions, locations
    │  (MySQL / Postgres) │
    └─────────────────────┘
               │
    ┌──────────▼──────────┐
    │ Execution Engine    │  MapReduce | Tez | Spark
    └──────────┬──────────┘
               │
              HDFS

The Hive Metastore — What It Stores

A relational database (MySQL/Postgres) that knows everything about your tables

Category What's stored
Schema Table names, column names, data types
Partitions Partition keys + their HDFS directories
Location HDFS path for each table's data
SerDe Serializer/Deserializer class (how rows are parsed)
Statistics Row counts, column histograms (used by the optimizer)

The metastore is a service, not a file — it runs independently and every query engine connects to it over Thrift.

The Metastore — Historical Significance

Before Hive (2007): Schema lived in engineers' heads, wiki pages, and spreadsheets.

What Hive invented: A shared, queryable catalog for distributed data — decoupling what data means from where it lives.

The Hive Metastore Protocol became an industry standard:

  • Apache Spark reads it natively via enableHiveSupport()
  • Presto and Trino connect to it
  • Apache Impala uses it
  • AWS Glue Data Catalog is API-compatible with it

Every modern data platform still speaks "Hive Metastore" — even when Hive itself is not running. This is Hive's most enduring contribution.

How a Hive Query Executes

  1. Parse — HiveQL → Abstract Syntax Tree
  2. Semantic check — Validate columns/types against metastore
  3. Logical plan — Optimize: predicate pushdown, column pruning
  4. Physical plan — Generate MapReduce / Tez / Spark DAG
  5. Execute — YARN submits jobs to cluster
  6. Return — Results back to client

The metastore is consulted at step 2 — every query needs it to know what columns exist and where the data lives.

Managed vs. External Tables

Managed (internal) table

  • Hive owns the data lifecycle
  • DROP TABLE deletes both metadata and HDFS data
  • Good for: derived/intermediate tables you own end-to-end

External table

  • Hive manages only metadata
  • DROP TABLE keeps HDFS data intact
  • Good for: shared source data used by multiple tools

Rule of thumb: Source data → external. Derived output → managed.

Schema on Read vs. Schema on Write

Schema on Write (traditional RDBMS)

  • Schema enforced at INSERT time — bad data is rejected immediately
  • Data is always consistent once stored
  • Cost: ETL required before data is queryable

Schema on Read (Hive / HDFS)

  • Schema applied at query time — the file is just bytes until you query it
  • Bad data doesn't fail at load; it returns NULLs or errors at query time
  • Cost: validation happens repeatedly, per-query

Schema on Read — In Practice

-- This CSV already exists on HDFS.
-- Hive maps a schema onto it at query time — no data moved, no data validated.
CREATE EXTERNAL TABLE raw_events (
    user_id    INT,
    event_type STRING,
    ts         STRING
)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LOCATION '/data/events/';

-- Schema is applied on the first SELECT — not at table creation.
SELECT * FROM raw_events LIMIT 5;

If a row has the wrong number of fields or a non-numeric user_id, the query returns NULL — not an error.

CSVs Directly from HDFS — Trade-offs

Pros

  • Zero ETL: point an external table at existing files and query immediately
  • Schema flexibility: update the Hive table definition without touching data
  • Universal: any tool can produce/consume CSVs — no proprietary dependencies

Cons

  • Silent type failures: bad data parses to NULL, not an error — hard to debug
  • Parsing overhead: CSV parsing is CPU-intensive and repeated on every query
  • No predicate pushdown: Hive must read every row before filtering (no skipping)
  • No compression index: reading 1 matching row still reads the entire file
  • Delimiter collisions: commas inside string values break parsing silently

Production recommendation: Use CSVs for raw ingestion. Convert to ORC or Parquet immediately after — queries can be 5–20× faster and storage shrinks 3–5×.

Demo: Connecting to Hive (Beeline)

# Connect to HiveServer2 via Beeline CLI
docker exec -it  hiveserver2 beeline
!connect jdbc:hive2://localhost:10000
# Username: hive  |  Password: (blank, just press Enter)
-- Verify connection
SHOW DATABASES;

-- Create a new database
CREATE DATABASE IF NOT EXISTS ecommerce;
USE ecommerce;

Demo: Creating Tables

-- Managed table: Hive owns data lifecycle
CREATE TABLE IF NOT EXISTS products (
    product_id STRING,
    name       STRING,
    category   STRING,
    price      DECIMAL(10,2),
    stock_quantity INT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE;

DESCRIBE products;           -- Column types
DESCRIBE FORMATTED products; -- Full metadata including HDFS location

Demo: External Table + Load Data

-- External table: Hive just maps schema onto existing HDFS data
CREATE EXTERNAL TABLE IF NOT EXISTS transactions (
    transaction_id STRING,
    ts              STRING,
    user_id        STRING,
    product_id     INT,
    category       STRING,
    quantity       INT,
    price          DECIMAL(10,2)
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
LOCATION '/datasets/ecommerce/transactions/';

-- Load products CSV into managed table
LOAD DATA INPATH '/datasets/ecommerce/products.csv' INTO TABLE products;

SELECT COUNT(*) FROM transactions;

Demo: Basic Analytics Queries

-- Revenue by category
SELECT category,
       COUNT(*) AS num_sales,
       SUM(price * quantity) AS total_revenue
FROM transactions
GROUP BY category
ORDER BY total_revenue DESC;
-- Join: top products with names
SELECT p.name, p.category,
       COUNT(t.transaction_id) AS times_purchased,
       SUM(t.quantity)         AS total_units
FROM products p
JOIN transactions t ON p.product_id = t.product_id
GROUP BY p.name, p.category
ORDER BY total_units DESC
LIMIT 20;

Why Is Hive Slow?

We're using the MapReduce execution engine (legacy)

MapReduce query lifecycle:

Parse SQL → Generate MR jobs → YARN scheduling
→ Map phase (disk read) → Shuffle (network) → Reduce (disk write)
→ Read output for next job stage → ...

Production options for faster Hive:

  • Hive on Tez: 3–10× faster (DAG execution, avoids disk between stages)
  • Hive on Spark: Spark as execution backend
  • Spark SQL + Hive metastore: Full Spark speed, Hive schema catalog

Bottom line: Hive is for batch jobs, not dashboards. Set expectations.

Switching the Execution Engine

Pass --hiveconf at connect time — no config file edits needed:

docker exec -it hiveserver2 beeline -u jdbc:hive2://localhost:10000 \
  --hiveconf hive.execution.engine=spark

Or switch mid-session:

SET hive.execution.engine=spark;

-- Confirm the change took effect
SET hive.execution.engine;

Switching the Execution Engine

Engine When to use
mr Default; reliable, slow batch jobs
spark Faster interactive queries (Spark must be running)
tez Fastest Hive-native option (not in our Docker stack)

Try it: Re-run the revenue query with engine=spark and compare wall-clock time.

Activity: Hive Query Challenge

Pairs | 15 minutes

Using the transactions and products tables from the demo, write HiveQL queries to answer:

  1. Average transaction value (price × quantity) per category
  2. Number of unique users per category
  3. Products that have never been purchased (Hint: LEFT JOIN with WHERE NULL)
  4. Total revenue per day (Hint: SUBSTR(timestamp, 1, 10))

Save working queries in a text file — you'll reuse these in homework.

Activity Debrief

Share solutions for queries 3 and 4 — the tricky ones.

Query 3: Products never purchased

SELECT p.product_id, p.name
FROM products p
LEFT JOIN transactions t ON p.product_id = t.product_id
WHERE t.transaction_id IS NULL;

Query 4: Revenue per day

SELECT SUBSTR(timestamp, 1, 10) AS sale_date,
       SUM(price * quantity) AS daily_revenue
FROM transactions
GROUP BY SUBSTR(timestamp, 1, 10)
ORDER BY sale_date;

The Bleeding Edge: Storage / Compute / Metadata as Separate Layers

Hive's original insight (2007): Decouple SQL from storage.

The problem Hive didn't fully solve: Still tightly coupled — HiveQL assumes HDFS; the metastore assumes specific SerDe formats; schema evolution is painful.

The modern endpoint: fully swappable, independent layers

Layer Hive era Modern
Storage HDFS S3 / GCS / Azure Blob / HDFS
Metadata Hive Metastore Iceberg catalog, AWS Glue, Project Nessie
Compute MapReduce → Spark Spark, Trino, Flink, DuckDB — any engine

Apache Iceberg (2018) is Hive's logical successor:

  • Open table format readable by any compute engine
  • First-class ACID transactions, time travel, and schema evolution
  • The metastore just stores a pointer; Iceberg manages its own metadata files

The principle: Data should outlive any single query engine. Design for interchange — not lock-in.

Session 1 Key Takeaways

  • Hive = SQL interface for data warehousing on HDFS
  • Translates HiveQL → distributed execution (MR / Tez / Spark)
  • Metastore stores all schema info — shared across the ecosystem
  • External tables protect source data; managed tables for derived data
  • Hive is batch-oriented — slow compared to Spark SQL, but mature and SQL-standard

Next session: Partitioning, bucketing, file formats, and Apache Pig

Don't forget: Assignment 2 is due Sunday at 11:59 PM

Looking Ahead: Week 8

Midterm Exam (Session 1) + Project Design Reviews (Session 2)

  • Midterm covers Weeks 1–7
  • 75 minutes, closed book, 1-page cheat sheet allowed (both sides)
  • Question types: multiple choice, short answer, code writing, system design

Start preparing now:

  • Review slides and homework solutions from Weeks 1–6
  • Practice writing MapReduce, Spark, and Hive code from memory
  • Study guide posted on Canvas by Friday

Speaker context: Students just finished a week of Spark DataFrames and SQL. Today we introduce Hive—the original SQL-on-Hadoop story. The key conceptual shift: Hive is for batch data warehousing, not interactive analysis. Emphasize the Hive metastore as a shared catalog that Spark, Presto, and Impala all rely on. Assignment 2 is released today—give the 2-minute overview near the start, then dive in. Demo-first, explain-after works well here.

Demo tip: Run this live. Students often have connection issues — if HiveServer2 is slow to start, have them run: docker restart hiveserver2, then wait 30 seconds.

Point out: these queries may take 30-60 seconds on MapReduce backend. That's expected and worth discussing — it's a batch system, not interactive.

Circulate: join syntax, date extraction (SUBSTR or FROM_UNIXTIME), and null-check patterns are common sticking points. Slow query time is expected — reassure students.