Databases 13 min read

Cassandra Time‑Series Data Modeling at Massive Scale Using Bucketing

This article explains how to model massive time‑series data in Cassandra by using bucketing techniques to control partition size, avoid hotspots, and improve write and read performance, including practical CQL schema examples and Python code for concurrent queries.

Architects Research Society

Apr 15, 2023

Cassandra Time‑Series Data Modeling at Massive Scale Using Bucketing

When starting with Cassandra for time‑series workloads, the biggest challenges are understanding how write patterns affect the cluster: writing too fast to a single partition creates hotspots, while overly large partitions hurt repair, streaming, and read performance.

One common modeling technique is bucketing , which lets you control how much data lives in each partition and spreads writes across the cluster.

Typical first‑step schema for raw sensor data:

CREATE TABLE raw_data (
  sensor text,
  ts timeuuid,
  readint int,
  PRIMARY KEY (sensor, ts)
) WITH CLUSTERING ORDER BY (ts DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_size': 1,
                    'compaction_window_unit': 'DAYS'};

This uses TimeWindowCompactionStrategy (TWCS) to mitigate the cost of large partitions, but without a TTL the partition size can grow without bound.

To keep partitions under ~100 MB, the article adds a day component to the primary key, creating a new table:

CREATE TABLE raw_data_by_day (
  sensor text,
  day text,
  ts timeuuid,
  reading int,
  PRIMARY KEY ((sensor, day), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 1};

Inserts use the current date and a TimeUUID:

INSERT INTO raw_data_by_day (sensor, day, ts, reading)
VALUES ('mysensor', '2017-01-01', now(), 10);

Reading data across many days requires issuing a query per day; the Python driver can execute these concurrently:

from itertools import product
from cassandra.concurrent import execute_concurrent_with_args

days = ["2017-07-01", "2017-07-12", "2017-07-03"]
session = Cluster(["127.0.0.1"]).connect("blog")
prepared = session.prepare("SELECT day, ts, reading FROM raw_data_by_day WHERE sensor = ? and day = ?")
args = product(["mysensor"], days)
results = execute_concurrent_with_args(session, prepared, args)

A variant creates a separate table for each month, e.g.:

CREATE TABLE raw_data_may_2017 (
  sensor text,
  ts timeuuid,
  reading int,
  PRIMARY KEY (sensor, ts)
) WITH compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 1};

This makes archiving old data simple—tables can be dropped after exporting to cheap storage such as HDFS or S3.

The second bucketing technique adds a bucket field to the partition key, allowing multiple partitions per day to spread load. Example schema for a tweet‑like stream:

CREATE TABLE tweet_stream (
  account text,
  day text,
  bucket int,
  ts timeuuid,
  message text,
  PRIMARY KEY ((account, day, bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
  AND compaction = {'class': 'TimeWindowCompactionStrategy',
                    'compaction_window_unit': 'DAYS',
                    'compaction_window_size': 1};

Inserting into several buckets:

INSERT INTO tweet_stream (account, day, bucket, ts, message) VALUES ('jon_haddad', '2017-07-01', 0, now(), 'hi');
INSERT INTO tweet_stream (account, day, bucket, ts, message) VALUES ('jon_haddad', '2017-07-01', 1, now(), 'hi2');
INSERT INTO tweet_stream (account, day, bucket, ts, message) VALUES ('jon_haddad', '2017-07-01', 2, now(), 'hi3');
INSERT INTO tweet_stream (account, day, bucket, ts, message) VALUES ('jon_haddad', '2017-07-01', 3, now(), 'hi4');

To fetch the ten newest messages across ten buckets, the article shows a Python concurrent query followed by a k‑way merge:

from itertools import chain
from cassandra.util import unix_time_from_uuid1

prepared = session.prepare("SELECT ts, message FROM tweet_stream WHERE account = ? and day = ? and bucket = ? LIMIT 10")
partitions = range(10)
args = product(["jon_haddad"], ["2017-07-01"], partitions)
result = execute_concurrent_with_args(session, prepared, args)
results = [x.result_or_exc for x in result]
data = chain(*results)
sorted_results = sorted(data, key=lambda x: unix_time_from_uuid1(x.ts), reverse=True)

The example demonstrates merging results from multiple partitions and notes that larger workloads would require a proper k‑way merge algorithm.

Overall, the article provides practical guidance on distributing time‑series data across a Cassandra cluster, emphasizing that the optimal strategy depends on workload characteristics and that no single solution fits all scenarios.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Python cassandra TimeSeries CQL Bucketing DataModeling

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.