Query-First Design and the NoSQL Decision

CS 6500 — Week 11, Session 2

CS 6500 — Big Data Analytics | Week 11

The Driving Question

"Your Cassandra table stores e-commerce orders, partitioned by customer_id. A month after launch, your product team asks: 'Can we show all orders for product X across all customers?' Your query fails instantly with 'Cannot execute this query as it might involve data filtering.' You didn't change any data. Why does changing the question break everything?"

CS 6500 — Big Data Analytics | Week 11

The Answer: Tables Are Answers, Not Storage

In Cassandra, a table is not a place to store data — it is a pre-computed answer to a specific question.

ScyllaDB
Why a C++ rewrite of Cassandra achieves 10× lower p99 latency on the same hardware — and why the CQL is identical

Query-First Modeling
The rules for designing partition keys and clustering columns — and why denormalization is a feature, not a compromise

By the end of this session you'll have a decision framework for choosing Redis, MongoDB, or Cassandra/ScyllaDB for any system design problem.

CS 6500 — Big Data Analytics | Week 11

Session 1 Recap

Cassandra's key architectural guarantees:

  • Ring — no master, every node equal, routes around failures automatically
  • Consistent hashing — partition key → token → responsible node(s)
  • Tunable consistency — R + W > RF for strong; ONE/ONE for maximum availability
  • Write path — commit log + memtable → SSTable (sequential writes, very fast)
  • CQL rule — partition key must appear in every WHERE clause; one table per query pattern

Today: learn the rules for designing those tables — and meet ScyllaDB, Cassandra's faster sibling.

CS 6500 — Big Data Analytics | Week 11

ScyllaDB

The same ring, the same CQL — 10× less tail latency

CS 6500 — Big Data Analytics | Week 11

Why ScyllaDB Exists — The JVM Problem

Apache Cassandra runs on the JVM. The JVM has garbage collection.

Cassandra node under load:
  [Request 1] 2ms ✓
  [Request 2] 3ms ✓
  [Request 3] 3ms ✓
  [GC PAUSE — 40ms freeze]     ← every node, unpredictably
  [Request 4] 44ms ✗ (SLA violation)
  [Request 5] 3ms ✓

p50 latency: 3ms — acceptable
p99 latency: 50ms+ — SLA violation, alerts, on-call pages

GC pauses are unpredictable and worsen under memory pressure. You can tune the JVM — but you cannot eliminate GC.

CS 6500 — Big Data Analytics | Week 11

ScyllaDB Architecture

Complete rewrite in C++ using the Seastar async framework (2015)

Component Cassandra (Java) ScyllaDB (C++)
Memory JVM heap, GC Manual allocation, no GC
Concurrency Threads (JVM scheduler) Per-core reactor (Seastar)
Networking Netty Zero-copy io_uring
CPU model Shared memory, locking Shared-nothing: each core owns its data
p99 latency 10–100ms (GC spikes) < 1ms typical

Shared-nothing: Core 0 owns shard 0's data; Core 1 owns shard 1's. No cross-core locking. No GC pauses. Throughput scales linearly with cores.

CS 6500 — Big Data Analytics | Week 11

Cassandra vs. ScyllaDB

Cassandra ScyllaDB
Language Java (JVM) C++ (Seastar)
API CQL 3.x CQL 3.x (identical)
Latency p99 Higher (GC spikes) Lower (no GC)
Migration effort Near zero (drop-in)
Operational tooling Mature (nodetool, etc.) Compatible + extras
Managed cloud Astra (DataStax) ScyllaDB Cloud
CS 6500 — Big Data Analytics | Week 11

When to Choose ScyllaDB

Choose ScyllaDB when:

  • Your SLA is p99 < 5ms and Cassandra's GC is causing violations
  • You want to shrink cluster size (ScyllaDB handles ~10× more ops per node)
  • You have Cassandra expertise already — the CQL is identical
CS 6500 — Big Data Analytics | Week 11

Query-First Data Modeling

Don't model your data. Model your queries.

CS 6500 — Big Data Analytics | Week 11

The Modeling Mantra

SQL: Design one normalized schema. Write any query at runtime.

CQL: Decide every query you need at design time. Build one table per query.

Product team query: "All orders for product X, newest first"
→ You need: PRIMARY KEY ((product_id), order_date, order_id)

Product team query: "All orders by customer X, newest first"
→ You need: PRIMARY KEY ((customer_id), order_date, order_id)

Same data? Write to BOTH tables on every INSERT.

Disk is cheap. Query latency is not. Duplicate the data; never scan the cluster.

CS 6500 — Big Data Analytics | Week 11

Partition Key Selection

Goal: even distribution + efficient queries + bounded partition size

Rule Reason
High cardinality (user_id, sensor_id) Distributes data evenly across nodes
Columns used in most WHERE clauses Enables efficient routing to one node
Composite key if needed (user_id, date) Bounds partition growth for time-series
Never low-cardinality (status, country) 3 statuses = 3 huge hotspot partitions
Never unbounded growth without date component A single user_id partition growing forever eventually crashes a node
CS 6500 — Big Data Analytics | Week 11

Partition Key Selection

Hotspot detection: nodetool tablestats — a partition > 100 MB is a warning sign.

CS 6500 — Big Data Analytics | Week 11

Demo: Time-Series IoT

-- IoT sensors report temperature and humidity every 10 seconds.
-- Query: "All readings for sensor S001 on 2025-03-14 between 08:00 and 10:00"

CREATE TABLE sensor_readings (
    sensor_id    TEXT,
    reading_date DATE,
    ts           TIMESTAMP,
    temperature  DECIMAL,
    humidity     DECIMAL,
    PRIMARY KEY ((sensor_id, reading_date), ts)
) WITH CLUSTERING ORDER BY (ts DESC);

Why (sensor_id, reading_date) as composite partition key?

CS 6500 — Big Data Analytics | Week 11

Demo: Time-Series IoT

  • sensor_id alone: a busy sensor writing every 10s for a year = 3M rows in one partition → too large
  • Adding reading_date bounds each partition to one day of readings → ~8,640 rows/partition → healthy
CS 6500 — Big Data Analytics | Week 11

Demo: Efficient Range Query

-- Insert a reading (hardcoded ts so the range query below always returns results)
INSERT INTO sensor_readings (sensor_id, reading_date, ts, temperature, humidity)
VALUES ('S001', '2025-03-14', '2025-03-14 09:15:00+0000', 72.5, 44.8);

-- ✅ Efficient: partition key fully specified, clustering column range
SELECT * FROM sensor_readings
WHERE sensor_id = 'S001'
  AND reading_date = '2025-03-14'
  AND ts > '2025-03-14 08:00:00+0000'
  AND ts < '2025-03-14 10:00:00+0000';

-- Show partitions (useful to verify data distribution)
SELECT DISTINCT sensor_id, reading_date FROM sensor_readings;

This query touches exactly one partition on one node — O(log n) within the partition. No cluster scan.

CS 6500 — Big Data Analytics | Week 11

Denormalization: One Write, Two Tables

Product team asks for two queries. Design two tables.

-- Table 1: "All activity for user X" (user profile page)
CREATE TABLE activity_by_user (
    user_id TEXT,
    ts      TIMESTAMP,
    action  TEXT,
    detail  TEXT,
    PRIMARY KEY (user_id, ts)
) WITH CLUSTERING ORDER BY (ts DESC);

CS 6500 — Big Data Analytics | Week 11
CS 6500 — Big Data Analytics | Week 11

Denormalization: One Write, Two Tables

Product team asks for two queries. Design two tables.


-- Table 2: "All login events across all users" (security dashboard)
CREATE TABLE activity_by_type (
    action  TEXT,
    ts      TIMESTAMP,
    user_id TEXT,
    detail  TEXT,
    PRIMARY KEY ((action), ts, user_id)
) WITH CLUSTERING ORDER BY (ts DESC);

Application: on every event, write to BOTH tables in the same batch. Reads are O(1). Duplicated storage is the trade-off.

CS 6500 — Big Data Analytics | Week 11

Lightweight Transactions and TTL

Lightweight Transactions (LWT) — compare-and-set

CREATE TABLE IF NOT EXISTS user_sessions (
    session_id UUID,
    user_id    TEXT,
    created_at TIMESTAMP,
    PRIMARY KEY (session_id)
);

-- Only insert if no row exists (idempotent registration)
INSERT INTO user_sessions (session_id, user_id, created_at)
VALUES (uuid(), 'user_42', toTimestamp(now()))
IF NOT EXISTS;
-- Returns [applied] = true/false

LWT uses Paxos — 4 round trips per operation. Use sparingly: high-write paths should never use LWT.

TTL — auto-expiring rows

-- Session token expires after 24 hours (86,400 seconds)
INSERT INTO user_sessions (session_id, user_id, created_at)
VALUES (uuid(), 'user_42', toTimestamp(now()))
USING TTL 86400;

TTL is perfect for: sessions, short-lived cache entries, IoT data retention windows (e.g., keep last 30 days only).

CS 6500 — Big Data Analytics | Week 11

Lightweight Transactions and TTL

Lightweight Transactions (LWT) — compare-and-set

TTL — auto-expiring rows

-- Session token expires after 24 hours (86,400 seconds)
INSERT INTO user_sessions (session_id, user_id, created_at)
VALUES (uuid(), 'user_42', toTimestamp(now()))
USING TTL 86400;

TTL is perfect for: sessions, short-lived cache entries, IoT data retention windows (e.g., keep last 30 days only).

CS 6500 — Big Data Analytics | Week 11

Collections

Collections — embed multi-values in one row

CREATE TABLE user_preferences (
    user_id   UUID PRIMARY KEY,
    tags      SET<TEXT>,          -- unordered, unique values
    history   LIST<TIMESTAMP>,    -- ordered, allows duplicates
    settings  MAP<TEXT, TEXT>     -- key-value pairs
);

UPDATE user_preferences SET tags = tags + {'analytics', 'premium'}
WHERE user_id = :id;
CS 6500 — Big Data Analytics | Week 11

Secondary Indexes

Secondary indexes — use with caution

CREATE INDEX ON orders_by_customer (status);

Secondary indexes work by querying every node — they are cluster-wide scans in disguise. Only appropriate for low-cardinality columns on small tables. In production: prefer a new table over a secondary index.

CS 6500 — Big Data Analytics | Week 11

The NoSQL Decision

Redis, MongoDB, or Cassandra/ScyllaDB?

CS 6500 — Big Data Analytics | Week 11

NoSQL Comparison

Redis MongoDB Cassandra / ScyllaDB HBase
Data model Key-value/structures JSON documents Wide-column rows Wide-column rows
CAP default CP (cluster mode) CP (replica set) AP (tunable) CP (ZooKeeper)
Write throughput Extreme (in-memory) High Very high High (WAL + MemStore)
CS 6500 — Big Data Analytics | Week 11

NoSQL Comparison

Redis MongoDB Cassandra / ScyllaDB HBase
Schema flexibility None Very flexible Fixed CQL schema Column families, sparse
Multi-DC native No Replica sets only Yes (NetworkTopologyStrategy) No (requires replication config)
Storage RAM Disk (WiredTiger) Disk (LSM-tree) HDFS (LSM-tree)
CS 6500 — Big Data Analytics | Week 11

NoSQL Comparison

Redis MongoDB Cassandra / ScyllaDB HBase
Best for Cache, sessions, pub/sub Documents, rich queries Time-series, write-heavy, HA Hadoop-native analytics, sparse data
CS 6500 — Big Data Analytics | Week 11

When to Pick Each Tool

Redis — the answer when speed is the only metric

  • Sub-millisecond latency requirement
  • Data fits in RAM (or you're willing to pay for enough RAM)
  • Use case: session store, leaderboards, rate limiting, pub/sub, distributed locks

MongoDB — the answer when structure is unknown at design time

  • Schema evolves frequently, documents have variable fields
  • Rich ad-hoc queries or aggregation pipelines needed
  • Use case: product catalog, user profiles, CMS content, IoT device metadata
CS 6500 — Big Data Analytics | Week 11

When to Pick Each Tool

Cassandra / ScyllaDB — the answer when writes must never stop

  • >50K writes/second sustained, globally distributed
  • High availability requirement (99.99%+ uptime, no elections, no downtime)
  • Use case: IoT telemetry, clickstream, time-series, global user activity logs
CS 6500 — Big Data Analytics | Week 11

When to Pick Each Tool

HBase — the answer when you're already in the Hadoop ecosystem

  • Data lives in HDFS and needs both random-access reads and batch MapReduce/Spark jobs
  • Sparse, wide tables: millions of columns, most cells empty (e.g., user-feature matrices)
  • Strong row-level consistency required (CP via ZooKeeper coordination)
  • Use case: genomics data, web crawl indexes, feature stores feeding Spark ML pipelines

Do not choose HBase if you have no HDFS — operational overhead is high without an existing Hadoop cluster

CS 6500 — Big Data Analytics | Week 11

Activity

NoSQL selection challenge — groups of 3, 5 minutes

CS 6500 — Big Data Analytics | Week 11

Activity

Groups of 3 | 5 minutes

For each scenario, choose Redis, MongoDB, or Cassandra/ScyllaDB and write one sentence justifying your choice.

Scenario A: A global live-streaming platform needs to track viewer counts per second per stream, retained for 24 hours, for 100,000 concurrent streams.

Scenario B: A healthcare platform stores patient visit notes — variable structure (diagnoses, medications, free-text notes), queried by doctor, patient, and date. HIPAA compliance required.

Scenario C: An ad-click tracking system processes 2 million clicks per second globally, retained for 90 days, queried by campaign and by user.

CS 6500 — Big Data Analytics | Week 11

Activity Solution

Scenario Choice Reason
A — Viewer counts Cassandra Time-series, 100K streams × 60 writes/min = massive write volume; TTL for 24h expiry; per-stream partition key
B — Medical records MongoDB Variable schema (diagnoses differ by patient); rich queries across multiple fields; document model fits naturally
C — Ad clicks Cassandra / ScyllaDB 2M writes/sec exceeds Redis RAM capacity at this scale; time-based TTL (90 days); PRIMARY KEY ((campaign_id, date), ts) efficient; ScyllaDB if p99 SLA is strict
CS 6500 — Big Data Analytics | Week 11

Activity Solution

Boundary cases worth discussing:

  • Scenario A: Redis works if viewer count is just an atomic counter (INCR) — but 24h retention of per-second data exceeds typical Redis memory budgets
  • Scenario C: Redis is used for real-time aggregation (sliding windows), Cassandra for durable storage — hybrid is common in production
CS 6500 — Big Data Analytics | Week 11

Week 11 Key Takeaways

Cassandra architecture:

  • Masterless ring, consistent hashing, gossip — no single point of failure
  • Tunable consistency: AP by default; QUORUM/QUORUM for strong consistency
  • Write path is fast (commit log + memtable); reads are slightly heavier

ScyllaDB:

  • Same CQL, same ring — C++ eliminates JVM GC pauses
  • 10× lower p99 tail latency; linear core scaling with shared-nothing model
  • Drop-in Cassandra replacement for latency-sensitive workloads
CS 6500 — Big Data Analytics | Week 11

Week 11 Key Takeaways

Query-first data modeling:

  • Design tables to answer specific queries — one table per access pattern
  • Partition key must be high-cardinality and present in every WHERE
  • Composite partition key (entity_id, date) bounds time-series partition growth
  • Denormalize: write to multiple tables; never use ALLOW FILTERING in production

The three-way NoSQL decision:

  • Redis → speed in RAM;
  • MongoDB → flexible documents + rich queries;
  • Cassandra → writes that must never stop, global distribution
CS 6500 — Big Data Analytics | Week 11

What's Missing?

Cassandra stores events reliably at scale — but it doesn't process them in flight

CS 6500 — Big Data Analytics | Week 11

The Gaps

  • No real-time event processing — Cassandra accepts writes but doesn't react to them as they arrive; triggering alerts, updating aggregations, or feeding downstream systems as events happen requires a dedicated streaming layer
  • No windowed aggregations — "how many failed logins in the last 60 seconds?" cannot be answered from Cassandra without a partition scan; a streaming system computes these windows in memory as events flow
  • No event replay — once a row is stored in Cassandra, consumers can't re-read the stream to rebuild state; you need a durable event log that retains order and allows rewind
  • No cross-store queries — getting a unified view across Cassandra, MongoDB, and S3 requires ETL or a federated query engine (tools like Trino handle this, though that's a separate architecture layer)
CS 6500 — Big Data Analytics | Week 11

What Comes Next

Gap Solution When
Durable event log + decoupled producers/consumers Kafka — topics, partitions, consumer groups Week 12
Windowed aggregations over a live stream Spark Structured Streaming — micro-batch over Kafka Week 13
Assignment 3 NoSQL schema design + query federation Due end of Week 12

Cassandra is the right tool for storing events at scale — but reacting to those events as they arrive requires a different architecture: the event streaming layer.

CS 6500 — Big Data Analytics | Week 11

Looking Ahead: Week 12

Stream Processing Fundamentals + Kafka

Session 1: Kafka Architecture
Topics, partitions, replication, and consumer groups — why Kafka is a durable distributed log, not just a message queue

Session 2: Producers and Consumers
Python client implementation, offset management, and delivery semantics — at-least-once vs. exactly-once

Assignment 3 is due end of Week 12 — it covers MongoDB and Cassandra schema design plus a query federation task. Start the schema design now.

CS 6500 — Big Data Analytics | Week 11

Speaker context: Session 1 gave students the ring architecture and the ALLOW FILTERING error. Today answers "so how do I design a schema that doesn't need ALLOW FILTERING?" — the query-first modeling section is the heart of this session. The three-way comparison at the end is extremely high-value for the final exam and job interviews. The ScyllaDB section is short (~10 min, conceptual) — its main point is that CQL is identical, so "drop-in replacement" is real. Close with the NoSQL selection activity to cement the decision framework.

Let this land before moving on. The failure isn't a bug — it's the database enforcing its data model. Today explains why, and what to do about it.

This framing — "tables are answers" — is the single most important mental model shift of the week. Repeat it verbatim when students ask why they can't just add a WHERE clause.

Quick verbal check: ask the room "what happens if a node is down when a write arrives?" Expect: hinted handoff. If blank stares, spend 2 minutes here before proceeding — the rest of the session depends on this foundation. (~3 minutes)

Emphasize the p99 framing. p50 looks fine on dashboards; p99 is what users actually experience on the unlucky request. SLA contracts are almost always written in p99. The GC problem is why every major Cassandra user (Discord, Netflix) either runs massive clusters with excess headroom or moved to ScyllaDB. (~5 minutes)

The key insight is "shared-nothing per core" — contrast with Cassandra's thread pool model where cores contend on shared data structures. Students with OS background will recognize this as the classic lock-contention problem. You don't need to go deep — the takeaway is "C++ + no GC + no cross-core contention = dramatically better tail latency." (~4 minutes)

Demo point: run the exact same CQL from Session 1 on ScyllaDB — it works unchanged. The migration is a data migration (backup/restore), not a code change. Stress "identical CQL" — students often assume switching databases means rewriting application code. Here, the driver configuration changes; the queries do not. (~3 minutes)

Demo point: run the exact same CQL from Session 1 on ScyllaDB — it works unchanged. The migration is a data migration (backup/restore), not a code change. Stress "identical CQL" — students often assume switching databases means rewriting application code. Here, the driver configuration changes; the queries do not. (~3 minutes)

This is the conceptual turning point. Students trained on relational design will push back: "but that's duplication!" Yes. Intentional duplication. The quote at the bottom is worth reading aloud. In Cassandra, storage cost is the explicit trade-off you make for O(1) reads. Ask: "what happens in SQL if a query changes?" (Add an index, maybe rewrite). "In Cassandra?" (Design a new table, rewrite the app). Both have costs — Cassandra front-loads them at design time. (~6 minutes)

Walk through the "never low-cardinality" rule with a concrete example: if you partition orders by `status` (pending/shipped/delivered), all "pending" orders land on the same token range → one or two nodes handle all pending-order traffic. That's a hotspot. The fix is always to add a high-cardinality column to the partition key. (~5 minutes)

Walk through the "never low-cardinality" rule with a concrete example: if you partition orders by `status` (pending/shipped/delivered), all "pending" orders land on the same token range → one or two nodes handle all pending-order traffic. That's a hotspot. The fix is always to add a high-cardinality column to the partition key. (~5 minutes)

Run this CREATE TABLE live. After creating it, ask: "What if I queried across multiple days — say a week?" Walk through: you'd query 7 partitions (one WHERE clause per day), or use a token range. Neither is as clean as a single-partition query, but it's a predictable access pattern. Bounded partition size is more important than eliminating multi-partition queries. (~6 minutes)

Run this CREATE TABLE live. After creating it, ask: "What if I queried across multiple days — say a week?" Walk through: you'd query 7 partitions (one WHERE clause per day), or use a token range. Neither is as clean as a single-partition query, but it's a predictable access pattern. Bounded partition size is more important than eliminating multi-partition queries. (~6 minutes)

Run this live and show the result. Then deliberately remove the `reading_date` predicate and show the ALLOW FILTERING error — this is the same error from the driving question. Students should now understand *why* it fails: without the full partition key, Cassandra would have to ask every node. (~5 minutes)

The application-side batch write is important: Cassandra has `BATCH` statements, but they are NOT transactions — they don't guarantee atomicity across tables the way a SQL transaction does. A failed batch may partially write. Application code must handle this (idempotent inserts with `IF NOT EXISTS` or just accept eventual consistency). This is the "you bought write throughput; here's what it costs" moment. (~5 minutes)

The application-side batch write is important: Cassandra has `BATCH` statements, but they are NOT transactions — they don't guarantee atomicity across tables the way a SQL transaction does. A failed batch may partially write. Application code must handle this (idempotent inserts with `IF NOT EXISTS` or just accept eventual consistency). This is the "you bought write throughput; here's what it costs" moment. (~5 minutes)

LWT is a trap: students see "IF NOT EXISTS" and think "great, I'll use this everywhere for safety." Drive home the 4 Paxos round trips — this can reduce write throughput by 10–20× on busy tables. Rule of thumb: LWT for registration/deduplication only. For everything else, design for idempotency instead. TTL demo: insert with TTL 10, then SELECT after 11 seconds to show the row is gone — very effective visually. (~6 minutes)

LWT is a trap: students see "IF NOT EXISTS" and think "great, I'll use this everywhere for safety." Drive home the 4 Paxos round trips — this can reduce write throughput by 10–20× on busy tables. Rule of thumb: LWT for registration/deduplication only. For everything else, design for idempotency instead. TTL demo: insert with TTL 10, then SELECT after 11 seconds to show the row is gone — very effective visually. (~6 minutes)

Collections look convenient but have a pitfall: Cassandra reads the entire collection to update one element. A `LIST` with 10,000 entries that you append to on every event is a performance problem. Rule of thumb: collections work well for small, bounded sets (< 100 elements). For large sets, design a separate child table. (~3 minutes)

This is a common mistake. Secondary indexes look like SQL indexes but behave completely differently — they are distributed, one per node, and a query against a secondary index fans out to every node simultaneously. For a 100-node cluster, that's 100 parallel partition scans. Materialized views (not covered here) are a better production alternative, but still have trade-offs. The safe rule: new table. (~3 minutes)

Give students 30 seconds to scan the table silently before explaining. The most important row for exams: "Read richness." Redis = only key lookup; MongoDB = any field; Cassandra = must know partition key. That single constraint is why query-first design exists. (~4 minutes)

Give students 30 seconds to scan the table silently before explaining. The most important row for exams: "Read richness." Redis = only key lookup; MongoDB = any field; Cassandra = must know partition key. That single constraint is why query-first design exists. (~4 minutes)

Give students 30 seconds to scan the table silently before explaining. The most important row for exams: "Read richness." Redis = only key lookup; MongoDB = any field; Cassandra = must know partition key. That single constraint is why query-first design exists. (~4 minutes)

Frame these as interview/design review talking points, not just course content. A systems design interviewer will give a scenario and ask "which NoSQL?" Students who can articulate Redis = speed, Mongo = flexibility, Cassandra = write scale stand out immediately. Spend time on the boundary cases: "what if I need both fast reads AND rich queries?" (probably MongoDB + Redis caching layer). (~5 minutes)

Frame these as interview/design review talking points, not just course content. A systems design interviewer will give a scenario and ask "which NoSQL?" Students who can articulate Redis = speed, Mongo = flexibility, Cassandra = write scale stand out immediately. Spend time on the boundary cases: "what if I need both fast reads AND rich queries?" (probably MongoDB + Redis caching layer). (~5 minutes)

These scenarios are deliberately ambiguous at the edges — Scenario A has a Redis-vs-Cassandra debate built in; Scenario C has a Redis-vs-ScyllaDB debate. The goal is not a single right answer but justified reasoning. Circulate while groups work and listen for misconceptions: "We'd use Redis for C because it's fastest" → probe: "2M writes/sec × 90 days of data — how much RAM is that?" (~5 minutes active + 3 minutes debrief)

Walk through the boundary cases — production systems often use two databases together (Redis for real-time aggregation, Cassandra for durable storage). That layered architecture is the "real" answer for Scenario C at companies like Twitter or Discord. If time allows, ask: "what would the Cassandra schema for Scenario C look like?" — students should propose PRIMARY KEY ((campaign_id, date), ts, click_id). (~5 minutes)

Walk through the boundary cases — production systems often use two databases together (Redis for real-time aggregation, Cassandra for durable storage). That layered architecture is the "real" answer for Scenario C at companies like Twitter or Discord. If time allows, ask: "what would the Cassandra schema for Scenario C look like?" — students should propose PRIMARY KEY ((campaign_id, date), ts, click_id). (~5 minutes)

Pause here and ask the room: "If your team uses Cassandra today and GC pauses are causing SLA violations, what's the migration path to ScyllaDB?" Answer: backup data, restore to ScyllaDB cluster, point application at new cluster. No schema changes, no code changes. That's the value of CQL compatibility. (~2 minutes)

The exam question will give a scenario and ask students to justify a database choice. Reinforce: the answer is always "it depends on the access pattern." A student who says "I'd pick Cassandra because it's fast" without mentioning partition key design or query pattern is not demonstrating the week's learning objective. (~2 minutes)

The first three gaps all point toward Kafka + Spark Structured Streaming, which is Weeks 12–13. The fourth gap (cross-store queries) is worth naming so students know it's a solved problem in industry — mention Trino as the canonical answer — but we won't cover it in the course. (~3 minutes)

Remind students that Assignment 3 is due end of Week 12 — they should start the Cassandra schema design this week, since the modeling skills are already in hand. The Kafka mechanics are taught next week. (~2 minutes)

End by connecting back to the driving question: "We started with ALLOW FILTERING breaking your product query. Query-first design fixed it. But what if the business needs to react the moment an event is written — not query it later?" That's Kafka. Strong closing loop that motivates the entire streaming module. (~2 minutes)