Hive Optimization & Apache Pig

CS 6500 — Week 7, Session 2

Session 1 Recap

Hive in one sentence: SQL interface that translates HiveQL to distributed jobs running on HDFS data.

Key concepts from Session 1:

  • Metastore — centralized schema catalog shared by Hive, Spark SQL, Presto
  • Managed vs. External tables — who owns the data lifecycle
  • Execution engines — MapReduce (slow), Tez, Spark (fast)

Today: Make Hive faster, then meet Apache Pig.

The Performance Problem

Without optimization: full table scan every query

transactions table: 5 years × 365 days × 10M rows/day = 18 billion rows

You want last month's revenue. Hive reads all 18 billion rows unless you tell it otherwise.

Two solutions:

  1. Partitioning — split data into subdirectories by column value
  2. Bucketing — hash-distribute data into fixed files for joins

Partitioning

Organize HDFS data into subdirectories by column value

/user/hive/warehouse/transactions/
    year=2024/month=1/data.parquet   ← 80 MB
    year=2024/month=2/data.parquet   ← 75 MB
    ...
    year=2023/month=12/data.parquet  ← 90 MB

Query with partition filter → scan only matching directories:

SELECT * FROM transactions WHERE year = 2024 AND month = 1;
-- Hive reads ONLY year=2024/month=1/ — skips everything else

Best columns to partition on: date/time dimensions, low-cardinality categoricals (country, status)

Bucketing

Hash-distribute rows into a fixed number of files

CLUSTERED BY (user_id) INTO 32 BUCKETS

Hive places each user_id in a deterministic bucket: hash(user_id) % 32

Benefits:

  • Efficient sampling: Read 1 bucket = random 3% sample
  • Map-side joins: If both tables bucketed on join key, skip shuffle
  • Works alongside partitioning

Best columns to bucket on: High-cardinality join keys (user_id, product_id)

Partitioning Best Practices

Do Don't
Partition by date dimensions Partition by user_id (too many dirs)
100–1,000 partitions total 100,000+ partitions (metadata overhead)
100 MB–1 GB per partition Thousands of tiny files
Low-cardinality columns Columns you never filter on

Over-partitioning is a real problem:

  • NameNode stores one metadata entry per partition directory
  • Millions of tiny files = slow metadata operations cluster-wide

Demo: Creating a Partitioned Table

-- Partitioned + stored as Parquet (columnar, compressed)
CREATE TABLE IF NOT EXISTS transactions_partitioned (
    transaction_id STRING, timestamp STRING, user_id STRING,
    product_id INT, category STRING, quantity INT, price DECIMAL(10,2)
)
PARTITIONED BY (year INT, month INT)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
STORED AS PARQUET;

-- Enable dynamic partitioning
SET hive.exec.dynamic.partition=true;
SET hive.exec.dynamic.partition.mode=nonstrict;

Demo: Loading and Querying Partitions

-- Populate from unpartitioned table (dynamic partitioning)
INSERT OVERWRITE TABLE transactions_partitioned PARTITION(year, month)
SELECT transaction_id, timestamp, user_id, product_id, category,
       quantity, price,
       YEAR(FROM_UNIXTIME(UNIX_TIMESTAMP(timestamp,'yyyy-MM-dd HH:mm:ss'))),
       MONTH(FROM_UNIXTIME(UNIX_TIMESTAMP(timestamp,'yyyy-MM-dd HH:mm:ss')))
FROM transactions;

SHOW PARTITIONS transactions_partitioned;

-- Query one month — only reads that directory
SELECT category, SUM(price * quantity) AS revenue
FROM transactions_partitioned
WHERE year = 2024 AND month = 1
GROUP BY category;

EXPLAIN SELECT * FROM transactions_partitioned WHERE year = 2024 AND month = 1;

File Formats: Text vs. Columnar

-- Create ORC and Parquet versions
CREATE TABLE transactions_orc    STORED AS ORC     AS SELECT * FROM transactions;
CREATE TABLE transactions_parquet STORED AS PARQUET AS SELECT * FROM transactions;

-- Compare sizes on HDFS
!hdfs dfs -du -h /user/hive/warehouse/ecommerce.db/
Format Orientation Compression Best for
Text/CSV Row None Human-readable, debugging
ORC Columnar Yes Hive-optimized workloads
Parquet Columnar Yes Cross-tool (Hive + Spark + Impala)

Recommendation: Use Parquet for new tables — works everywhere.

What Is Apache Pig?

High-level platform for ETL and data transformation pipelines

  • Pig Latin: Procedural dataflow language (not SQL)
  • Compiles to MapReduce / Tez / Spark — same as Hive
  • Created at Yahoo! Research (2006) for complex multi-step transforms
  • Schema is optional — can process unstructured data
-- "Find total revenue per category for expensive items"
txns     = LOAD '/datasets/ecommerce/transactions.csv' USING PigStorage(',')
               AS (txn_id:chararray, ts:chararray, user_id:chararray,
                   product_id:int, category:chararray, qty:int, price:float);
expensive = FILTER txns BY price > 100;
by_cat    = GROUP expensive BY category;
result    = FOREACH by_cat GENERATE group AS category,
                SUM(expensive.price * expensive.qty) AS revenue;
DUMP result;

Pig vs. Hive: When to Use Which

Hive Pig
Language style Declarative (SQL) Procedural (dataflow steps)
Best for Ad hoc analytics, warehousing Complex ETL, multi-step transforms
Schema Required at table creation Optional
Debugging EXPLAIN plan ILLUSTRATE (sample trace)
Users SQL analysts ETL engineers
Custom logic UDFs (Java/Python) UDFs (Java/Python)

Modern reality: New projects use Spark. But millions of lines of Pig scripts still run in production.

Pig Latin Core Operations (1 of 3)

Data input / output:

Operation Purpose SQL equivalent
LOAD Read data from HDFS FROM clause
STORE Write results to HDFS (INSERT INTO)
DUMP Print to console — triggers execution

Pig Latin Core Operations (2 of 3)

Filtering and projection:

Operation Purpose SQL equivalent
FILTER Remove rows by condition WHERE
FOREACH ... GENERATE Transform / select columns SELECT
LIMIT Cap number of output rows LIMIT

Pig Latin Core Operations (2 of 3)

Grouping, joining, and ordering:

Operation Purpose SQL equivalent
GROUP Group rows by key GROUP BY
JOIN Combine two relations JOIN
ORDER Sort output ORDER BY

Execution model:

  • Pig uses lazy evaluation — transformations build a logical plan
  • Nothing actually runs until DUMP or STORE is reached
  • Similar to Spark's transformation → action model

Demo: Starting Pig (Grunt Shell)

# Launch Pig interactive shell
docker exec -it pig pig
-- Load transactions
transactions = LOAD '/datasets/ecommerce/transactions.csv'
    USING PigStorage(',')
    AS (txn_id:chararray, ts:chararray, user_id:chararray,
        product_id:int, category:chararray, quantity:int, price:float);

-- Inspect schema
DESCRIBE transactions;

-- Sample the data without full execution
ILLUSTRATE transactions;

Demo: Filter, Transform, Aggregate

-- Filter: only expensive items (price > 100)
expensive = FILTER transactions BY price > 100;

-- Project: compute total per line
with_total = FOREACH expensive GENERATE
    category, price, (price * quantity) AS line_total;

-- Group by category
by_cat = GROUP with_total BY category;

-- Aggregate per group
cat_revenue = FOREACH by_cat GENERATE
    group              AS category,
    COUNT(with_total)  AS num_sales,
    SUM(with_total.line_total) AS total_revenue,
    AVG(with_total.price)      AS avg_price;

-- Sort and display
sorted = ORDER cat_revenue BY total_revenue DESC;
DUMP sorted;

Demo: Joins in Pig

-- Load products
products = LOAD '/datasets/ecommerce/products.csv' USING PigStorage(',')
    AS (product_id:int, name:chararray, category:chararray,
        price:float, stock:int);

-- Join on product_id
joined = JOIN transactions BY product_id, products BY product_id;

-- Project fields from both sides (note :: namespace)
product_sales = FOREACH joined GENERATE
    products::name AS product_name,
    transactions::quantity AS qty;

-- Aggregate and find top 10
by_product = GROUP product_sales BY product_name;
totals     = FOREACH by_product GENERATE
    group AS product_name, SUM(product_sales.qty) AS total_sold;
top_10     = LIMIT (ORDER totals BY total_sold DESC) 10;

DUMP top_10;

Storing Results

-- Write results to HDFS (triggers execution like DUMP)
STORE sorted INTO '/user/student/pig_output/category_revenue'
    USING PigStorage(',');
# View output from outside Pig
docker exec -it namenode bash
hdfs dfs -ls /user/student/pig_output/category_revenue/
hdfs dfs -cat /user/student/pig_output/category_revenue/part-r-00000

Pig creates part-r-NNNNN output files (one per reducer) — same as MapReduce.

Activity: Pig Script Challenge

Individual or pairs | 7 minutes

Write a Pig Latin script that:

  1. Loads transactions.csv
  2. Filters for Electronics category only
  3. Computes total spending per user_id (SUM(price × quantity))
  4. Finds the top 5 spenders in Electronics
  5. Stores results in /user/student/pig_output/top_electronics_buyers

Write the script in a text editor (not Grunt) — we'll run it together.

Activity Solution

txns = LOAD '/datasets/ecommerce/transactions.csv' USING PigStorage(',')
    AS (txn_id:chararray, ts:chararray, user_id:chararray,
        product_id:int, category:chararray, quantity:int, price:float);

electronics = FILTER txns BY category == 'Electronics';

with_spend = FOREACH electronics GENERATE
    user_id, (price * quantity) AS spend;

by_user  = GROUP with_spend BY user_id;
totals   = FOREACH by_user GENERATE
    group AS user_id, SUM(with_spend.spend) AS total_spend;
top5     = LIMIT (ORDER totals BY total_spend DESC) 5;

STORE top5 INTO '/user/student/pig_output/top_electronics_buyers'
    USING PigStorage(',');

The Three-Way Comparison (1 of 2)

Choose your tool based on the task:

Scenario Best tool
Analyst needs ad hoc SQL reports Hive
10-step ETL: clean → join → enrich → aggregate Pig (or Spark)
Interactive exploration + ML pipeline Spark SQL
Legacy pipeline already in Hive/Pig Keep it!

The Three-Way Comparison (2 of 2)

Why each wins in its domain:

Tool Strength Weakness
Hive SQL familiarity, mature warehouse features, ACID Slow (batch); not for interactive queries
Pig Procedural multi-step ETL, optional schema, flexible UDFs Verbose; mostly legacy
Spark SQL In-memory speed, MLlib integration, iterative algorithms Higher cluster memory requirements

Real world today: Hive metastore + Spark SQL is the dominant combo — Spark executes, Hive catalogs.

The Dataflow Spectrum

Two extremes — and the middle ground where Pig lived:

Declarative (SQL/Hive)          Procedural (MapReduce)
   "Tell me WHAT"       ←→        "Tell me HOW"
  optimizer decides               developer controls
  every physical step             every physical step
Position Tool Trade-off
Fully declarative SQL / HiveQL Optimizer has freedom; you have little control
Dataflow middle Pig, Spark (transforms) You define steps; compiler optimizes the DAG
Fully procedural Raw MapReduce Full control; full responsibility

For distributed systems: declarative enables optimization but limits control. Procedural gives control but requires expertise.

The Pattern That Persists

Pig's real contribution wasn't the tool — it was proving that dataflow programming works at scale.

Modern tool What it inherited from Pig
Spark Lazy transformation DAG — build a plan, execute on action
Apache Beam Unified pipeline model (same API for batch and streaming)
SQL query optimizers Build dataflow graphs internally to plan and reorder operations

The idea: express computation as a directed acyclic graph of named data steps, let the runtime optimize and execute.

What Is Apache Beam?

A unified programming model for batch and streaming pipelines — Pig's spiritual successor

  • Google-originated (2016), now Apache top-level project
  • Write once → run on multiple runners: Spark, Flink, Google Dataflow

What Is Apache Beam?

import apache_beam as beam

with beam.Pipeline() as p:
    revenue = (
        p
        | 'Read'    >> beam.io.ReadFromText('transactions.csv')
        | 'Parse'   >> beam.Map(parse_line)
        | 'Filter'  >> beam.Filter(lambda r: r['price'] > 100)
        | 'Revenue' >> beam.Map(lambda r: (r['category'],
                                           r['price'] * r['qty']))
        | 'Total'   >> beam.CombinePerKey(sum)
    )

Notice the | chain — this is a DAG of named steps, exactly like Pig Latin.

Apache Pig vs. Apache Beam

Apache Pig Apache Beam
Era 2006–2015 (peak usage) 2016–present
Paradigm Dataflow DAG — batch Dataflow DAG — batch + streaming
Language Pig Latin (custom DSL) Python / Java / Go SDK
Execution targets MapReduce, Tez, Spark Spark, Flink, GCP Dataflow
Schema Optional Defined in pipeline
Status Maintenance mode Actively developed
Key contribution Proved DAG model works Unified batch/stream with one API

Apache Pig vs. Apache Beam

The intellectual thread: Pig → Spark (batch DAG) → Beam (batch + stream DAG). The model evolved; the core insight didn't.

Week 7 Key Takeaways

Hive:

  • SQL for batch data warehousing on HDFS
  • Optimize with partitioning (prune directories) and Parquet (columnar)
  • Metastore = shared schema catalog for the entire ecosystem

Pig:

  • Procedural ETL: express complex multi-step transforms step by step
  • Lazy execution (like Spark); DUMP/STORE triggers the job
  • ILLUSTRATE for debugging — samples trace through the script

Week 7 Key Takeaways

The bigger picture:

  • There is a spectrum from declarative (SQL) to procedural (MapReduce) — Pig lived in the middle
  • Pig proved the DAG-based dataflow model works; Spark and Apache Beam carry that idea forward
  • Hive + Pig are batch; Spark is interactive; all three share the Hive metastore

Assignment 2 Reminder

Due: Sunday 11:59 PM — this week!

  • RDD-based analysis
  • DataFrame / SQL equivalent queries
  • Performance comparison (RDD vs. DataFrame)
  • Optimization (caching, partitioning, broadcast joins)

Tips:

  • Test on a small subset first, then scale up
  • Use Spark UI at http://localhost:4040 to debug slow stages
  • Review Week 5 (RDDs) and Week 6 (DataFrames) slides

After Spring Break: Midterm

Week 9 — Midterm Exam and Project Design Submission

  • Session 1: 75-minute midterm exam review

    • Closed book
    • Covers Weeks 1–7: HDFS, MapReduce, Spark, Hive, Pig
  • Session 2: 75-minute midterm exam

    • Closed book, 1-page cheat sheet allowed (one side)
    • Covers Weeks 1–7: HDFS, MapReduce, Spark, Hive, Pig

Speaker context: This session is ~15 min theory (partitioning/bucketing), ~15 min Hive optimization demo, then ~30 min of Apache Pig including live Grunt shell demos. Students should leave able to write basic Pig Latin scripts and understand when to pick Hive vs. Pig vs. Spark SQL. After the 3-way comparison, close with the historical "dataflow spectrum" context and a brief Pig vs. Beam comparison — this sets up the intellectual thread that runs from Pig → Spark → Beam. Assignment 2 due Sunday — remind at the start.

ILLUSTRATE is a killer debugging tool — it traces a few sample rows through the entire script. Show this prominently.

Emphasize: Pig didn't fail — it succeeded so well that its core ideas were absorbed into every modern big-data tool. Spark's RDD transformations ARE Pig Latin with a Python syntax.