Cassandra Time‑Series Data Modeling at Massive Scale Using Bucketing
This article explains how to model massive time‑series data in Cassandra by using bucketing techniques to control partition size, avoid hotspots, and improve write and read performance, including practical CQL schema examples and Python code for concurrent queries.
When starting with Cassandra for time‑series workloads, the biggest challenges are understanding how write patterns affect the cluster: writing too fast to a single partition creates hotspots, while overly large partitions hurt repair, streaming, and read performance.
One common modeling technique is bucketing , which lets you control how much data lives in each partition and spreads writes across the cluster.
Typical first‑step schema for raw sensor data:
CREATE TABLE raw_data (
sensor text,
ts timeuuid,
readint int,
PRIMARY KEY (sensor, ts)
) WITH CLUSTERING ORDER BY (ts DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_size': 1,
'compaction_window_unit': 'DAYS'};This uses TimeWindowCompactionStrategy (TWCS) to mitigate the cost of large partitions, but without a TTL the partition size can grow without bound.
To keep partitions under ~100 MB, the article adds a day component to the primary key, creating a new table:
CREATE TABLE raw_data_by_day (
sensor text,
day text,
ts timeuuid,
reading int,
PRIMARY KEY ((sensor, day), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1};Inserts use the current date and a TimeUUID:
INSERT INTO raw_data_by_day (sensor, day, ts, reading)
VALUES ('mysensor', '2017-01-01', now(), 10);Reading data across many days requires issuing a query per day; the Python driver can execute these concurrently:
from itertools import product
from cassandra.concurrent import execute_concurrent_with_args
days = ["2017-07-01", "2017-07-12", "2017-07-03"]
session = Cluster(["127.0.0.1"]).connect("blog")
prepared = session.prepare("SELECT day, ts, reading FROM raw_data_by_day WHERE sensor = ? and day = ?")
args = product(["mysensor"], days)
results = execute_concurrent_with_args(session, prepared, args)A variant creates a separate table for each month, e.g.:
CREATE TABLE raw_data_may_2017 (
sensor text,
ts timeuuid,
reading int,
PRIMARY KEY (sensor, ts)
) WITH compaction = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1};This makes archiving old data simple—tables can be dropped after exporting to cheap storage such as HDFS or S3.
The second bucketing technique adds a bucket field to the partition key, allowing multiple partitions per day to spread load. Example schema for a tweet‑like stream:
CREATE TABLE tweet_stream (
account text,
day text,
bucket int,
ts timeuuid,
message text,
PRIMARY KEY ((account, day, bucket), ts)
) WITH CLUSTERING ORDER BY (ts DESC)
AND compaction = {'class': 'TimeWindowCompactionStrategy',
'compaction_window_unit': 'DAYS',
'compaction_window_size': 1};Inserting into several buckets:
INSERT INTO tweet_stream (account, day, bucket, ts, message) VALUES ('jon_haddad', '2017-07-01', 0, now(), 'hi');
INSERT INTO tweet_stream (account, day, bucket, ts, message) VALUES ('jon_haddad', '2017-07-01', 1, now(), 'hi2');
INSERT INTO tweet_stream (account, day, bucket, ts, message) VALUES ('jon_haddad', '2017-07-01', 2, now(), 'hi3');
INSERT INTO tweet_stream (account, day, bucket, ts, message) VALUES ('jon_haddad', '2017-07-01', 3, now(), 'hi4');To fetch the ten newest messages across ten buckets, the article shows a Python concurrent query followed by a k‑way merge:
from itertools import chain
from cassandra.util import unix_time_from_uuid1
prepared = session.prepare("SELECT ts, message FROM tweet_stream WHERE account = ? and day = ? and bucket = ? LIMIT 10")
partitions = range(10)
args = product(["jon_haddad"], ["2017-07-01"], partitions)
result = execute_concurrent_with_args(session, prepared, args)
results = [x.result_or_exc for x in result]
data = chain(*results)
sorted_results = sorted(data, key=lambda x: unix_time_from_uuid1(x.ts), reverse=True)The example demonstrates merging results from multiple partitions and notes that larger workloads would require a proper k‑way merge algorithm.
Overall, the article provides practical guidance on distributing time‑series data across a Cassandra cluster, emphasizing that the optimal strategy depends on workload characteristics and that no single solution fits all scenarios.
Architects Research Society
A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.