Big Data 12 min read

Hive Metadata Migration and Merging Tool for Consolidating Multiple Hive Metastores

This article describes how NetEase developed a Hive metadata migration and merging tool that consolidates metadata from multiple independent Hive clusters into a single Hive metastore without moving HDFS data, detailing the challenges, ID handling, database operations, and step‑by‑step migration process.

Big Data Technology Architecture

Apr 13, 2021

Hive Metadata Migration and Merging Tool for Consolidating Multiple Hive Metastores

Within NetEase there are hundreds of Hive clusters. To meet the metadata unification needs of the NetEase Mammoth big‑data platform, we need to merge the metadata of several independent Hive clusters without moving the actual HDFS data files, e.g., merging the metadata of hive2, hive3, and hive4 into the MySQL metastore of hive1 and then accessing the data through hive1.

We first considered Hive's built‑in EXPORT and IMPORT commands, which can export and import both data and metadata. However, these commands do not fit our scenario for several reasons:

We do not need to re‑import the data.

Each Hive instance contains hundreds of thousands of tables and tens of millions of partitions, making it impossible to specify partition names in the IMPORT command.

The IMPORT command is slow, and on occasional failures it cannot roll back partially imported metadata; manual DROP TABLE would also delete the underlying data, which is unacceptable in our production environment.

Therefore we decided to develop a custom Hive metadata migration and merging tool with the following requirements:

All metadata from a source Hive cluster can be migrated to a target Hive cluster without moving the data files.

If migration fails, the system can roll back to the state before the import.

The source Hive service may be stopped, but the target Hive service must remain online during migration.

The migration process should complete within ten minutes to minimize business impact.

Challenges of Metadata Merging

Hive stores its metadata (the metastore) in a MySQL database. In Hive 1.2.1 the metastore consists of 54 tables, such as DBS (database names), TBLS (table names), and PARTITIONS (partition information).

Complex Table Dependency Relationships

The 54 tables are tightly linked through primary‑key/foreign‑key relationships. For example:

The DB_ID column in DBS is referenced by more than 20 other tables.

The TBL_ID column in TBLS is referenced by more than 20 other tables. TBLS.DB_ID is a foreign key to DBS, and TBLS.SD_ID is a foreign key to SDS. PARTITIONS.TBL_ID references TBLS, and PARTITIONS.SD_ID references SDS. DATABASE_PARAMS.DB_ID references DBS.

These nested dependencies create chains such as [DBS] → [TBLS] → [PARTITIONS] → [PARTITION_KEY_VALS]. Several clusters have dependency depths of five or more, leading to the following problems during metadata merging:

Primary‑key IDs from the source Hive must be changed to avoid conflicts with the target Hive.

After changing IDs, all foreign‑key references must be updated consistently (e.g., if DBS.ID changes from 1 to 100, every TBLS.DB_ID, PARTITIONS.DB_ID, etc., must also change from 1 to 100).

Tables must be imported in dependency order: parent tables first, then child tables, then sub‑child tables, otherwise the import will fail.

Modifying Primary and Foreign Key IDs

We solved the ID‑modification problem with a simple arithmetic method:

Query the maximum ID of each table in the target Hive, then add the source table's ID to that maximum. The formula is: new_id = source_id + target_max_id. Because all tables use the same logic, the program does not need to maintain explicit parent‑child ID mappings.

To avoid conflicts when the target Hive creates new databases during online import, we add a fixed offset (e.g., 100) to the formula: new_id = source_id + target_max_id + offset.

Database Operations

We used MyBatis to operate on the two MySQL instances. Metadata is read from the source database, IDs are transformed as described above, and the modified rows are inserted into the target database.

Because MyBatis requires a Java bean for each table, manually writing 54 beans would be error‑prone. We therefore parsed Hive DDL statements with Druid, then used CodeModel to auto‑generate 54 Java classes (see com.netease.hivetools.apps.SchemaToMetaBean).

Metadata Migration Procedure

Backup the source and target metastore databases.

Import the source metadata into a temporary database exchange_db. The temporary database isolates the live source cluster, whose ID counters keep changing.

Using a fixed temporary database name simplifies the workflow and reduces manual command‑parameter errors.

Configure JDBC settings for source and destination databases in hive-tools.properties:

# exchange_db
exchange_db.jdbc.driverClassName=com.mysql.jdbc.Driver
exchange_db.jdbc.url=jdbc:mysql://10.172.121.126:3306/hivecluster1?useUnicode=true&characterEncoding=utf8&autoReconnect=true&allowMultiQueries=true
exchange_db.jdbc.username=src_hive
exchange_db.jdbc.password=abcdefg

# dest_hive
dest_hive.jdbc.driverClassName=com.mysql.jdbc.Driver
dest_hive.jdbc.url=jdbc:mysql://10.172.121.126:3306/hivecluster1?useUnicode=true&characterEncoding=utf8&autoReconnect=true&allowMultiQueries=true
dest_hive.jdbc.username=dest_hive
dest_hive.jdbc.password=abcdefg

Execute the metadata migration command:

export SOURCE_NAME=exchange_db
export DEST_NAME=dest_hive

/home/hadoop/java-current/jre/bin/java -cp "./hive-tools-current.jar" com.netease.hivetools.apps.MetaDataMerge --s=$SOURCE_NAME --d=$DEST_NAME

The tool first checks for duplicate Hive databases between source and destination and aborts if any are found.

Delete duplicate databases if necessary (modify DEL_DB and DEL_TBL parameters):

# Modify DEL_DB (comma‑separated, must include "default") and DEL_TBL (empty to delete all tables)
export SOURCE=exchange_db
export DEL_DB=default,nisp_nhids,real,azkaban_autotest_db
export DEL_TBL=

~/java-current/jre/bin/java -cp "./hive-tools-current.jar" com.netease.hivetools.apps.DelMetaData --s=$SOURCE --d=$DEL_DB --t=$DEL_TBL

Run the migration command again.

Inspect logs; if errors occur, delete the partially migrated metadata from the destination and retry. If no errors, verify that the new metadata works via the Hive client.

Following this procedure, NetEase has successfully merged a large number of Hive metastore databases internally without any issues.

Open Source

The hive-tools project is available at https://github.com/NetEase/hive-tools .

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Data Migration metadata Hive MyBatis MetaStore

Written by

Big Data Technology Architecture

Exploring Open Source Big Data and AI Technologies

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.