Databases 6 min read

MapReduce‑Style Parallel Query Processing with Citus

The article explains how Citus enables sharding and MapReduce‑style parallel query execution in PostgreSQL, showing performance gains, example bucket algorithms, and how standard SQL can replace custom MapReduce code for large‑scale data analytics.

Architects Research Society

Feb 3, 2020

MapReduce‑Style Parallel Query Processing with Citus

Step 1: Sharding

Citus improves performance by partitioning a PostgreSQL table into many smaller shards that are distributed across physical nodes, allowing the system to harness collective compute power; queries are automatically routed to the appropriate shard and results are aggregated.

Think in MapReduce

MapReduce, popularized by Hadoop, solves large‑scale data problems by decomposing a task into parallel sub‑tasks; Citus provides a similar capability without requiring a separate framework.

For example, to count total page views you can split the data into four buckets and process each bucket independently:

for i = 1 to 4:
  for page in pageview:
    bucket[i].append(page)

After distribution, you can compute per‑bucket counts in parallel:

for i = 1 to 4:
  for page in bucket:
    bucket_count[i]++

Aggregating the four partial counts yields a total that is roughly four times faster than running the count on a single node.

MapReduce as a Concept

Citus implements several executors that handle workloads in a MapReduce‑like fashion; its real‑time executor can run the same logic using plain SQL.

When a query such as SELECT count(*) FROM pageviews runs on a table with 32 shards, Citus rewrites it into multiple count queries on each shard, then merges the results on the coordinator. The same approach works for aggregates like AVG:

SELECT avg(page), day FROM pageviews_shard_1 GROUP BY day; SELECT avg(page), day FROM pageviews_shard_2 GROUP BY day;

After collecting the per‑shard averages, Citus computes the final average automatically, so the user can simply write:

SELECT avg(page), day FROM pageviews GROUP BY day;

This demonstrates that with Citus you can achieve significant performance improvements on large data sets using familiar SQL, without writing extensive MapReduce code.

Original article: https://www.citusdata.com/blog/2019/02/21/thinking-in-mapreduce-but-with-sql/

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL Sharding MapReduce Parallel Query Citus Distributed PostgreSQL

Written by

Architects Research Society

A daily treasure trove for architects, expanding your view and depth. We share enterprise, business, application, data, technology, and security architecture, discuss frameworks, planning, governance, standards, and implementation, and explore emerging styles such as microservices, event‑driven, micro‑frontend, big data, data warehousing, IoT, and AI architecture.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.