MongoDB — Documents, Queries, and Aggregation

CS 6500 — Week 10, Session 2

Session 1 Recap

Redis — the key ideas:

  • In-memory → sub-millisecond reads
  • 5 data structures match 5 access patterns
  • Cache-aside: populate on miss; TTL manages staleness
  • Pub/Sub: decouple producers from consumers

The question Redis leaves open:

"Redis is fast — why not store everything there?"

  • Memory-bound (RAM is expensive at scale)
  • No rich queries — no WHERE, no JOIN, no aggregations
  • No durable persistence for complex data

Today: MongoDB — flexible schema + rich queries + disk persistence

CAP Theorem

In any distributed system, you can only guarantee 2 of 3:

Property Meaning
Consistency (C) Every read sees the most recent write
Availability (A) Every request gets a (non-error) response
Partition Tolerance (P) System continues despite network failures

Network partitions are unavoidable in distributed systems — so the real choice is C vs. A when a partition occurs.

  • CP systems: Return an error rather than stale data (e.g., HBase, Redis Cluster)
  • AP systems: Return stale data rather than an error (e.g., Cassandra, DynamoDB)
  • MongoDB replica sets: CP by default (reads from primary); configurable to AP with readPreference: secondary

CAP — Concrete Example

Scenario: 3-node MongoDB replica set. Network partition isolates the primary.

[Node A: Primary] ~~~ ✂ partition ✂ ~~~ [Node B: Secondary]
                                         [Node C: Secondary]

CP behavior (default):

  • Reads fail (or wait) until a new primary is elected
  • Guarantees you never see stale data
  • You sacrifice Availability

AP behavior (readPreference: secondary):

  • Reads succeed — from B or C
  • Data may be seconds behind the last primary write
  • You sacrifice Consistency

CAP trade-offs only matter during partitions. Most of the time, distributed databases give you both C and A.

Documents as the Unit of Storage

MongoDB stores data as BSON (Binary JSON) documents

{
  "_id": ObjectId("64a1b2c3d4e5f6789012"),
  "name": "Alice Chen",
  "age": 29,
  "email": "alice@example.com",
  "tags": ["python", "spark", "kafka"],
  "address": {
    "city": "Toledo",
    "state": "OH",
    "zip": "43601"
  },
  "created_at": ISODate("2025-03-15T09:30:00Z")
}

Key differences from RDBMS rows:

  • Nested objects and arrays are first-class
  • No enforced schema — documents in the same collection can have different fields
  • _id is always present (auto-generated if omitted)

Collections vs. Tables

RDBMS concept MongoDB equivalent
Database Database
Table Collection
Row Document
Column Field
Primary key _id field
Foreign key Reference (manual) or embedded sub-document
JOIN $lookup pipeline stage
Index Index
Schema No enforced schema (optional validation)

The critical difference: Collections don't enforce a schema. Every document can look different. This is a superpower AND a footgun.

Schema Design: Embed vs. Reference

The central design decision in MongoDB

Embed sub-documents when:

  • Data is always accessed together (read in one query)
  • 1-to-few relationship (not hundreds of nested items)
  • Nested data doesn't need to be queried independently
{ "_id": 1, "name": "Alice",
  "addresses": [
    { "type": "home", "city": "Toledo" },
    { "type": "work", "city": "Columbus" }
  ]
}

Schema Design: Embed vs. Reference (cont.)

Reference separate documents when:

  • Data is accessed independently
  • 1-to-many (hundreds or thousands of items)
  • Data is shared across multiple parent documents
  • Document size could exceed MongoDB's 16 MB limit
// users collection
{ "_id": 1, "name": "Alice", "order_ids": [101, 102, 103] }

// orders collection
{ "_id": 101, "user_id": 1, "total": 450, "items": [...] }

Rule of thumb:

  • Embed for reads you need fast
  • Reference to avoid duplication and handle unbounded growth

Embed vs. Reference — Decision Guide

Scenario Recommendation
Blog post with author name Embed author name (denormalize)
Blog post with all comments Reference — comments can grow unbounded
Order with line items (1–50) Embed — always loaded together
User with all orders (lifetime) Reference — could be thousands
Product with category (shared) Reference — shared across many products
Invoice with billing address Embed — must not change if address changes later

Bad schema design in MongoDB = slow aggregations and repeated $lookup calls. Design for your query, not your data.

MongoDB CRUD Operations

// Connect
docker exec -it mongodb mongosh
use ecommerce

// CREATE
db.users.insertOne({ name: "Alice", age: 29, tags: ["python", "spark"] })
db.users.insertMany([
  { name: "Bob", age: 34, tags: ["java"] },
  { name: "Carol", age: 27, tags: ["python", "kafka"] }
])

// READ
db.users.find({ age: { $gt: 25 } })                    // All users > 25
db.users.find({ age: { $gt: 25 } }, { name: 1, age: 1 }) // Project fields
db.users.findOne({ name: "Alice" })                     // First match only

// UPDATE
db.users.updateOne({ name: "Alice" }, { $set: { age: 30 }, $push: { tags: "kafka" } })
db.users.updateMany({ age: { $lt: 30 } }, { $set: { status: "junior" } })

// DELETE
db.users.deleteOne({ name: "Alice" })

Query Operators

// Comparison
{ age: { $gt: 25, $lte: 40 } }       // 25 < age <= 40
{ status: { $in: ["active", "pending"] } }
{ name: { $ne: "Alice" } }

// Logical
{ $and: [{ age: { $gt: 20 } }, { status: "active" }] }
{ $or:  [{ status: "vip" }, { balance: { $gt: 1000 } }] }

// Array operators
{ tags: "python" }                    // tag array contains "python"
{ tags: { $all: ["python", "spark"] } }  // contains all
{ tags: { $size: 3 } }               // exactly 3 elements

// Nested field
{ "address.city": "Toledo" }          // dot notation for nesting

The Aggregation Pipeline

MongoDB's answer to SQL GROUP BY, JOIN, and analytics

Pipeline stages are chained — output of one becomes input of the next:

db.orders.aggregate([
  { $match:   { status: "completed" } },           // Filter rows (WHERE)
  { $group:   { _id: "$category",                  // GROUP BY category
                total: { $sum: "$amount" },
                count: { $sum: 1 } } },
  { $sort:    { total: -1 } },                     // ORDER BY total DESC
  { $limit:   5 },                                 // LIMIT 5
  { $project: { category: "$_id",                  // SELECT (rename fields)
                total: 1, count: 1, _id: 0 } }
])

SQL equivalent:

SELECT category, SUM(amount) AS total, COUNT(*) AS count
FROM orders WHERE status = 'completed'
GROUP BY category ORDER BY total DESC LIMIT 5

Key Pipeline Stages

Stage Purpose SQL Equivalent
$match Filter documents WHERE
$group Aggregate by key GROUP BY
$sort Order results ORDER BY
$limit Cap result count LIMIT
$project Select / rename fields SELECT
$lookup Join another collection JOIN
$unwind Flatten array field (unnest / lateral join)
$addFields Compute new fields (computed column)
$count Count matching docs COUNT(*)

Demo: Aggregation on the E-Commerce Dataset

# Load sample data
docker exec -it mongodb mongosh --eval "
  db = db.getSiblingDB('ecommerce');
  db.orders.insertMany([
    {customer:'Alice', category:'Electronics', amount:450, status:'completed', month:1},
    {customer:'Bob',   category:'Books',       amount:35,  status:'completed', month:1},
    {customer:'Alice', category:'Electronics', amount:800, status:'completed', month:2},
    {customer:'Carol', category:'Clothing',    amount:120, status:'completed', month:1},
    {customer:'Bob',   category:'Electronics', amount:650, status:'pending',   month:1},
    {customer:'Carol', category:'Books',       amount:200, status:'completed', month:2}
  ])
"
// Top categories by total revenue (completed orders only)
db.orders.aggregate([
  { $match:  { status: "completed" } },
  { $group:  { _id: "$category", revenue: { $sum: "$amount" }, orders: { $sum: 1 } } },
  { $sort:   { revenue: -1 } },
  { $project: { category: "$_id", revenue: 1, orders: 1, _id: 0 } }
])

Demo: $lookup — Joining Collections

// Add a customers collection
db.customers.insertMany([
  { _id: "Alice", tier: "premium", country: "US" },
  { _id: "Bob",   tier: "basic",   country: "US" },
  { _id: "Carol", tier: "premium", country: "CA" }
])

// Join orders with customer details
db.orders.aggregate([
  { $match: { status: "completed" } },
  { $lookup: {
      from: "customers",       // collection to join
      localField: "customer",  // field in orders
      foreignField: "_id",     // field in customers
      as: "customer_info"      // new array field
  }},
  { $unwind: "$customer_info" },  // flatten the 1-element array
  { $project: {
      customer: 1, amount: 1, category: 1,
      tier: "$customer_info.tier", _id: 0
  }}
])

Hands-On: MongoDB Analytics Lab (20 min)

Individual | Docker MongoDB container

Using the ecommerce database from the demo:

Task 1: Find all orders over $200 placed in month 1 — return customer name and amount

Task 2: Aggregation — total revenue per customer (completed orders only)

Task 3: $lookup — enrich each order with the customer's tier from the customers collection

Task 4 (Design): You're building a blog platform with:

  • Users, Posts, Comments, Tags
    Propose a MongoDB schema. For each collection: show a JSON example, and justify each embed vs. reference decision.

Deliverable: Shell commands for Tasks 1–3 + text schema diagram for Task 4

Activity Debrief: Blog Schema Design

A reasonable approach:

// posts collection — embed what you always need
{ "_id": 1, "title": "Redis in Production",
  "author": { "id": 42, "name": "Alice" },  // denormalized name
  "tags": ["redis", "caching", "performance"],
  "comment_count": 47,
  "created_at": "2025-03-01T10:00:00Z" }

// comments collection — reference parent post
{ "_id": 201, "post_id": 1, "author_id": 99,
  "text": "Great article!", "created_at": "2025-03-02T08:15:00Z" }

Key decisions:

  • Author name embedded in post (fast display; stale risk is acceptable)
  • Comments referenced — unbounded growth, paginated independently
  • Tags embedded — small, always loaded with post, queried together

When to Use What

Scenario Redis MongoDB PostgreSQL
Session storage ✅ First choice ❌ Overkill ❌ Too slow
Product catalog ❌ No rich queries ✅ Flexible schema ✅ If structured
Real-time leaderboard ✅ Sorted sets ❌ Too slow ❌ Too slow
Complex joins ❌ Not designed for ⚠️ $lookup (limited) ✅ First choice
Write-heavy IoT stream ❌ Memory limits ⚠️ Possible ❌ Bottleneck
Financial records ❌ No ACID ❌ Limited ACID ✅ First choice
Social graph ❌ No traversal ⚠️ Reference chains ❌ Recursive CTEs

Key insight: Production systems use all three — each handles what it does best.

MongoDB in Production: Key Considerations

Indexing — queries without indexes do full collection scans:

db.orders.createIndex({ customer: 1 })           // Single field
db.orders.createIndex({ status: 1, amount: -1 }) // Compound
db.orders.createIndex({ category: "text" })       // Full-text search
db.orders.explain("executionStats").find(...)     // Query plan inspection

Document size limit: 16 MB per document — the hard limit that forces referencing for large arrays.

Transactions: MongoDB 4.0+ supports multi-document ACID transactions — but they're slower. Design schemas to avoid needing them.

Replication: Replica sets provide high availability (primary + secondaries). Automatic failover in ~10 seconds.

Common Misconceptions — Addressed

"MongoDB is schema-less so you don't need to design schemas"

  • Wrong. Schema design is MORE important in MongoDB.
  • Bad embedding = 16 MB limit hit, slow scans, difficult updates.
  • Always design for your queries first.

"$lookup is as fast as a SQL JOIN"

  • Wrong. $lookup does not use indexes on the joined collection efficiently in all cases.
  • If you $lookup constantly, consider whether referencing was the right choice.

"CAP means I have to pick 2 and ignore the 3rd"

  • Wrong. CAP trade-offs only matter during network partitions.
  • Most of the time, distributed databases provide both C and A.

Week 10 Summary

Redis:

  • In-memory data structure server: String, Hash, List, Set, Sorted Set
  • Cache-aside and write-through patterns; TTL + LRU eviction
  • Pub/Sub for real-time messaging

MongoDB:

  • Document store: flexible BSON, no enforced schema
  • Schema design: embed (fast reads) vs. reference (avoid duplication)
  • Aggregation pipeline: $match → $group → $sort → $project → $lookup

CAP Theorem: Trade-off is C vs. A during partitions — not a permanent choice.

Quiz 10 is now open on Canvas — due before Week 11.

Homework (Due Before Week 11)

Assignment A — Redis Caching Layer

  • Python script implementing cache-aside with 60-second TTL
  • Run 1000 random lookups, measure hit rate
  • Experiment: change TTL to 10s, explain hit-rate difference
  • Deliverable: redis_cache.py + redis_analysis.md

Assignment B — MongoDB Schema Design

  • Design a music streaming platform schema (Users, Songs, Playlists, Play events)
  • Justify each embed vs. reference decision
  • Write 3 aggregation pipelines: top songs, revenue by tier, avg session length by device
  • Deliverable: music_schema.md

Preview: Week 11 — Cassandra/ScyllaDB

Next week: Wide-column stores for globally distributed, write-heavy workloads

  • Cassandra: AP system — always available, eventually consistent
  • ScyllaDB: Drop-in Cassandra replacement written in C++, lower tail latency
  • CQL: Cassandra Query Language — SQL-like but query-first design required
  • Ring architecture: Consistent hashing, no single point of failure

Key question to think about:

"You have 1 billion IoT sensor readings per day. They need to be written fast and queried by device + time range. Redis? MongoDB? Or something else?"

See you next week.

Speaker context: Students just finished hands-on Redis. Now we shift to MongoDB — flexible schema, rich queries, disk persistence. The key tensions to surface: (1) schema design matters MORE in MongoDB than SQL, not less; (2) embedding vs. referencing is the central decision; (3) CAP theorem becomes concrete here. The $lookup lab task often runs over — have a pre-run version ready to demo. The schema design challenge (Task 4) has no single right answer; grade on justification.

Run this live. Walk through each stage, pausing to explain what the intermediate result looks like. Ask: "What would happen if I moved $match after $group?"

Task 4 has no single right answer. Grade on justification quality. Key discussion point: comments can grow unbounded → reference. Tags can be embedded if count is small (< 20). Author name is often embedded (denormalized) on posts for fast display.