Databases 8 min read

Using GPLoad to Batch Load HDFS Data into Greenplum: Comparison with Hive and MPP Database Options

The article compares Hive and Greenplum as offline and MPP data‑warehouse solutions, reviews Hive query engine alternatives, and provides a detailed tutorial—including YAML configuration and a shell script—for using GPLoad to import HDFS data into Greenplum.

DataFunTalk

Sep 20, 2021

Using GPLoad to Batch Load HDFS Data into Greenplum: Comparison with Hive and MPP Database Options

If Hive represents the classic offline data‑warehouse, Greenplum stands for the MPP (Massively Parallel Processing) database, offering both row and column storage with full SQL support and query times typically between 1 and 10 seconds.

Hive’s native query capability is weak, so several engines are commonly used: Spark (Thrift Server, moderate speed), Impala (CDH‑specific, C++‑based, fast but hard to install outside CDH), Presto (Facebook‑origin, fast but Hive‑incompatible syntax), Kylin (MOLAP on Spark, sub‑10‑second queries but complex modeling), ClickHouse (fast, Hadoop‑independent, limited SQL and joins), and Greenplum itself (MPP, robust SQL, 1‑10 s queries).

For tables under a billion rows, the author recommends using Greenplum directly as the data‑warehouse or data‑platform; for larger volumes, Hive on Spark can reduce hardware costs while Greenplum remains a viable ADS option alongside Kylin, ClickHouse, Doris, Hana, etc.

GPLoad is Greenplum’s parallel data‑loading tool that uses external tables and the gpfdist service. Its control file is written in YAML 1.1, requiring a valid YAML structure to define input sources, column mappings, file format, delimiters, error limits, and optional pre‑/post‑SQL statements.

Example GPLoad YAML configuration:

VERSION: 1.0.0.1
DATABASE: ops
USER: gpadmin
HOST: mdw-1
PORT: 5432
GPLOAD:
  INPUT:
    - SOURCE:
        LOCAL_HOSTNAME:
          - etl1-1
          - etl1-2
          - etl1-3
          - etl1-4
        PORT: 8081
        FILE:
          - /var/load/data/*
    - COLUMNS:
        - name: text
        - amount: float4
        - category: text
        - descr: text
        - date: date
    - FORMAT: text
    - DELIMITER: '|'
    - ERROR_LIMIT: 25
    - LOG_ERRORS: True
  OUTPUT:
    - TABLE: payables.expenses
    - MODE: INSERT
  SQL:
    - BEFORE: "INSERT INTO audit VALUES('start', current_timestamp)"
    - AFTER: "INSERT INTO audit VALUES('end', current_timestamp)"

A reusable template YAML (gpload_cfg_template.yml) is provided, followed by a shell script hive2gp_gpload.sh that copies the template, substitutes database and table names, downloads the HDFS text file, runs the GPLoad command, and cleans up temporary files.

#!/bin/bash
# Usage: sh hive2gp_gpload.sh dbname tablename
# Password‑less login via ~/.pgpass
if [ ! -n "$2" ]; then
    echo "Invalid arguments. Usage: sh hive2gp_gpload.sh dbname tablename"
    exit -1
else
    dbname=$1;
    tablename=$2;
fi
source ~/.bash_profile;
shellpath=/data/hdsp/infra/ETL_HOME/shell
yml_file=${shellpath}/gpload_yml/${dbname}.${tablename}.yml
if [ ! -f ${yml_file} ]; then
   cp $shellpath/gpload_yml/gpload_cfg_template.yml ${yml_file}
   sed -i "s/{table}/${tablename}/g" $yml_file;
   sed -i "s/{dbname}/${dbname}/g" $yml_file;
fi
# Remove previous extraction
rm -rf ${shellpath}/tempdata/${dbname}.${tablename}
echo "[${dbname}.${tablename}]:start to get hdfs file!"
# Get file from HDFS
hdfs dfs -get hdfs://hdp01/data/apps/hive/warehouse/${dbname}.db/${tablename} ${shellpath}/tempdata/${dbname}.${tablename}
echo "[${dbname}.${tablename}]:get hdfs file success! start to load!"
# Load data into Greenplum
/data/greenplum6/greenplum-db/bin/gpload -f $yml_file -U cabbeenc -l ${shellpath}/gpload_log/${dbname}_${tablename}_$(date +"%Y-%m-%d-%H-%M-%S").log
echo "[${dbname}.${tablename}]:load success!"
# Clean up
rm -rf ${shellpath}/tempdata/${dbname}.${tablename}

To run the script, execute sh hive2gp_gpload.sh dbname tablename. Prerequisites include Hive tables stored as text with the default \u0001 delimiter, GPLoad installed and in the PATH, appropriate database permissions, and matching schema and column order between Hive and Greenplum.

The article concludes with a community invitation to join the “Data Platform Study Club” WeChat group for further discussion.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data Hive YAML shell script Greenplum GPLoad

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.