Databases 15 min read

Greenplum (GPDB) Architecture, Features, and Operational Tools Overview

This article explains Greenplum's MPP architecture, master‑segment design, high‑availability, interconnect network, rich management tools, parallel query planning, data loading techniques, and additional capabilities such as LDAP authentication and resource queues, demonstrating why it is a strong next‑generation big‑data query engine.

Baidu Waimai Technology Team

Apr 20, 2017

Greenplum (GPDB) Architecture, Features, and Operational Tools Overview

Author: Zhou Leihao, Baidu Waimai big‑data engineer.

Source: "Programmer" March technical board original submission.

Abstract: This article introduces the architecture and key technical characteristics of the Greenplum big‑data engine, starting with GPDB background, describing system modules, and then focusing on Greenplum's unique features, parallel execution, and operational details, explaining why Greenplum is chosen as a next‑generation query‑engine solution.

Greenplum MPP Architecture

Greenplum (referred to as GPDB) is an open‑source data warehouse built on a modified PostgreSQL, designed for large‑scale data analysis. Compared with Hadoop, Greenplum is more suitable as a storage, compute, and analysis engine for big data.

GPDB follows a classic Master/Slave (Master‑Segment) architecture. A Greenplum cluster contains one Master node and multiple Segment nodes, each capable of running several databases. Greenplum adopts a shared‑nothing (MPP) architecture: state information such as database metadata and memory caches are not stored on the nodes; inter‑node communication occurs over a network. Data is distributed across nodes for storage, and parallel query processing improves performance. Each node processes only its own data, and the Master aggregates results, enabling linear scalability by adding nodes.

Figure 1: GPDB basic architecture

Clients connect to the GPDB cluster via the network. The Master Host is the entry point (client access point), while Segment Hosts are the nodes that receive and execute SQL statements. The Master does not store user data; Segments store data and perform queries. The Master parses SQL, creates execution plans, dispatches tasks to Segments, and aggregates results for the client.

Greenplum Master

The Master stores only system metadata; all business data resides on Segments. It handles client connections, SQL parsing, plan generation, task distribution, and result collection, ensuring the Master does not become a performance bottleneck.

Master high‑availability (Figure 2) works like Hadoop NameNode HA: a Standby Master synchronizes catalog and transaction logs with the Primary Master via a synchronization process. If the Primary fails, the Standby takes over all Master duties.

Figure 2: Master node high availability

Segments

Multiple Segments can exist in a Greenplum cluster. Each Segment stores a portion of user data and handles data access. Clients cannot access Segments directly; all requests go through the Master. During query execution, each Segment processes its own data in parallel; if data from other Segments is needed, it is transferred via the Interconnect. Adding more Segment servers linearly increases performance, unlike shared‑all databases.

Figure 3: Segment responsible for data storage and access

Each Segment’s data is redundantly stored on another Segment (Mirror). When a Primary Segment fails, the Mirror automatically takes over. After recovery, the gprecoverseg -F tool synchronizes data.

Interconnect

The Interconnect is the network layer of Greenplum (Figure 4). By default it uses UDP, but Greenplum adds checksums, providing reliability comparable to TCP with better performance. With TCP, the number of Segment instances is limited to 1000; UDP has no such limit.

Figure 4: Greenplum Interconnect network layer

Greenplum as a New Solution

Having introduced GPDB’s basic architecture, the following sections describe key features that make GPDB an attractive solution.

Rich Toolset – Operations Made Easy

Compared with other open‑source projects, GPDB provides extensive management tools and a graphical web monitoring interface, helping administrators monitor cluster health and server status.

During a recent public‑cloud migration, Impala became unstable when the total query segments reached 100. After investigation, we discovered kernel‑level issues and leveraged GPDB tools such as gpcheck and gpcheckperf to verify system configuration and perform hardware performance tests.

After using gpssh‑exkeys to enable password‑less SSH across all machines, the gpassh command can execute commands on all nodes simultaneously, as shown by running pwd on five machines from the Master.

Other useful tools include gprecoverseg for segment recovery and gpactivatestandby for switching primary/standby Masters, simplifying cluster maintenance and enabling custom management solutions.

Query Planning and Parallel Execution – SQL Optimization

GPDB’s execution plan includes traditional operations such as scans, joins, aggregations, and sorts, plus a unique “motion” operation that moves data between Segments during query processing.

The following simplified TPCH Query 1 illustrates the plan:

Execution proceeds from the bottom up:

Each Segment scans its portion of the customer table, applies filters, and sends the result to other Segments.

On each Segment, the orders table is joined with the received results, and the final result is sent back to the Master.

This parallel processing model allows multiple PostgreSQL instances on the same host to handle complex joins and queries, fully utilizing hardware resources.

How to Inspect the Execution Plan?

If a query performs poorly, examining its execution plan can reveal bottlenecks. Key questions include:

Is any operation taking an unusually long time?

Do estimated costs match actual runtime?

Are highly selective predicates applied early?

Is the join order optimal?

Are partitions scanned selectively?

Are hash aggregations and hash joins chosen appropriately?

Efficient Data Loading – Bulk Import No Longer a Bottleneck

Since the Master only handles client interaction and control, data loading first distributes rows based on a chosen distribution column. All nodes read data concurrently; a hash algorithm determines which rows stay locally and which are sent to other Segments via the Interconnect. Using external tables and the gpfdist service, GPDB can ingest up to 2 TB per hour, allowing seamless ETL migration from Impala.

The advantage of gpfdist is that all Segments can fully utilize the service, provided they have network access to it.

Other Notable Features

GPDB supports LDAP authentication, enabling seamless migration of Impala role‑based access control.

Built on PostgreSQL 8.2, GPDB can be accessed via the psql CLI, as well as JDBC, ODBC, etc., with minimal adaptation at the application layer.

Resource queues allow independent queues for different workload types (e.g., VIP users, ETL, ad‑hoc), with priority settings to allocate more resources to higher‑priority queues during contention.

Recent internal testing with TPCH on a five‑node cluster showed impressive query speeds, though exact performance numbers await a more stable environment. GPDB’s PostgreSQL foundation provides rich statistical functions, linear horizontal scalability, built‑in fault tolerance, and powerful management commands, giving it clear advantages over Impala in SQL support, real‑time capability, and stability.

This article offers a preliminary look at Greenplum; deeper analyses and practical experience will be shared on the DA wiki. For detailed syntax and feature references, consult the official PostgreSQL documentation.

References: Pivotal Greenplum® Database 4.3.9.1 Documentation

SDCC 2017 Shanghai will be held March 17‑19, featuring three technical summits on Operations, Databases, and Architecture, with CTOs, architects, and tech directors from leading internet companies. Group discounts (5+ people) save ¥1500. Read the original article to register.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Operations Database MPP Parallel Query Greenplum

Written by

Baidu Waimai Technology Team

The Baidu Waimai Technology Team supports and drives the company's business growth. This account provides a platform for engineers to communicate, share, and learn. Follow us for team updates, top technical articles, and internal/external open courses.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.