Partitioning

Organize HDFS data into subdirectories by column value

/user/hive/warehouse/transactions/
    year=2024/month=1/data.parquet   ← 80 MB
    year=2024/month=2/data.parquet   ← 75 MB
    ...
    year=2023/month=12/data.parquet  ← 90 MB

Query with partition filter → scan only matching directories:

SELECT * FROM transactions WHERE year = 2024 AND month = 1;
-- Hive reads ONLY year=2024/month=1/ — skips everything else

Best columns to partition on: date/time dimensions, low-cardinality categoricals (country, status)

Do	Don't
Partition by date dimensions	Partition by user_id (too many dirs)
100–1,000 partitions total	100,000+ partitions (metadata overhead)
100 MB–1 GB per partition	Thousands of tiny files
Low-cardinality columns	Columns you never filter on

Format	Orientation	Compression	Best for
Text/CSV	Row	None	Human-readable, debugging
ORC	Columnar	Yes	Hive-optimized workloads
Parquet	Columnar	Yes	Cross-tool (Hive + Spark + Impala)

	Hive	Pig
Language style	Declarative (SQL)	Procedural (dataflow steps)
Best for	Ad hoc analytics, warehousing	Complex ETL, multi-step transforms
Schema	Required at table creation	Optional
Debugging	EXPLAIN plan	ILLUSTRATE (sample trace)
Users	SQL analysts	ETL engineers
Custom logic	UDFs (Java/Python)	UDFs (Java/Python)

Operation	Purpose	SQL equivalent
`LOAD`	Read data from HDFS	FROM clause
`STORE`	Write results to HDFS	(INSERT INTO)
`DUMP`	Print to console — triggers execution	—

Operation	Purpose	SQL equivalent
`FILTER`	Remove rows by condition	WHERE
`FOREACH ... GENERATE`	Transform / select columns	SELECT
`LIMIT`	Cap number of output rows	LIMIT

Hive Optimization & Apache Pig

CS 6500 — Week 7, Session 2

Session 1 Recap

The Performance Problem

Partitioning

Bucketing

Partitioning Best Practices

Demo: Creating a Partitioned Table

Demo: Loading and Querying Partitions

File Formats: Text vs. Columnar

What Is Apache Pig?

Pig vs. Hive: When to Use Which

Pig Latin Core Operations (1 of 3)

Pig Latin Core Operations (2 of 3)

Pig Latin Core Operations (2 of 3)

Demo: Starting Pig (Grunt Shell)

Demo: Filter, Transform, Aggregate

Demo: Joins in Pig

Storing Results

Activity: Pig Script Challenge

Activity Solution

The Three-Way Comparison (1 of 2)

The Three-Way Comparison (2 of 2)

The Dataflow Spectrum

The Pattern That Persists

What Is Apache Beam?

What Is Apache Beam?

Apache Pig vs. Apache Beam

Apache Pig vs. Apache Beam

Week 7 Key Takeaways

Week 7 Key Takeaways

Assignment 2 Reminder

After Spring Break: Midterm

Operation	Purpose	SQL equivalent
`GROUP`	Group rows by key	GROUP BY
`JOIN`	Combine two relations	JOIN
`ORDER`	Sort output	ORDER BY

Scenario	Best tool
Analyst needs ad hoc SQL reports	Hive
10-step ETL: clean → join → enrich → aggregate	Pig (or Spark)
Interactive exploration + ML pipeline	Spark SQL
Legacy pipeline already in Hive/Pig	Keep it!

Tool	Strength	Weakness
Hive	SQL familiarity, mature warehouse features, ACID	Slow (batch); not for interactive queries
Pig	Procedural multi-step ETL, optional schema, flexible UDFs	Verbose; mostly legacy
Spark SQL	In-memory speed, MLlib integration, iterative algorithms	Higher cluster memory requirements

Position	Tool	Trade-off
Fully declarative	SQL / HiveQL	Optimizer has freedom; you have little control
Dataflow middle	Pig, Spark (transforms)	You define steps; compiler optimizes the DAG
Fully procedural	Raw MapReduce	Full control; full responsibility

Modern tool	What it inherited from Pig
Spark	Lazy transformation DAG — build a plan, execute on action
Apache Beam	Unified pipeline model (same API for batch and streaming)
SQL query optimizers	Build dataflow graphs internally to plan and reorder operations

	Apache Pig	Apache Beam
Era	2006–2015 (peak usage)	2016–present
Paradigm	Dataflow DAG — batch	Dataflow DAG — batch + streaming
Language	Pig Latin (custom DSL)	Python / Java / Go SDK
Execution targets	MapReduce, Tez, Spark	Spark, Flink, GCP Dataflow
Schema	Optional	Defined in pipeline
Status	Maintenance mode	Actively developed
Key contribution	Proved DAG model works	Unified batch/stream with one API