Mastering Hive Small File Management: Strategies to Boost Performance
This article explains why tiny Hive files degrade storage and query efficiency, outlines how they are created, and presents practical Spark and Hive configuration techniques—including dynamic partitioning, AQE, Reduce tuning, and automated daily merge jobs—to effectively consolidate small files and improve overall data‑warehouse performance.
Background
Small files are a long‑standing pain point in data‑warehouse environments because they consume excessive storage space and degrade query performance. Effective governance of these files is essential for maintaining Hive’s efficiency and stability.
How Small Files Are Generated
Daily tasks and dynamic partition inserts (using Spark2 MapReduce) produce a large number of small files, causing a surge in Map tasks.
More Reduce tasks generate more small files, as each Reduce corresponds to an output file.
Source data may already contain many small files, e.g., from APIs or Kafka.
Real‑time data ingestion into Hive also creates many small files.
Impact of Small Files
From Hive’s perspective, each small file triggers a separate Map task, each launching a JVM, leading to massive resource waste and performance loss.
In HDFS, each small file occupies about 150 bytes of metadata; a large number of such files overloads the NameNode, slowing metadata operations and increasing read/write latency.
Storage consumption rises dramatically; for example, an average file size of 280 KB can be reduced to 249 KB after merging.
Solutions
2.1 Use Spark 3 to Merge Small Files
Spark’s Adaptive Query Execution (AQE) can automatically merge small partitions. Spark 3.2+ introduces the Rebalance operation, which leverages AQE to balance partitions, merge overly small files, and split skewed partitions.
2.2 Reduce the Number of Reduce Tasks
<code>set mapred.reduce.tasks=100; -- set Reduce count (Mapper:Reduce = 10:1)</code>2.3 Distribute By Rand()
Using
distribute by rand()forces a shuffle that evenly distributes data across partitions, ensuring each partition has a similar size.
<code>where t0.ds='${lst1date}'
and xxx=xxx
distribute by rand()</code>2.4 Add a Post‑Ingestion Cleanup Task
Run a cleanup job after data transfer to merge small files before downstream consumption.
2.5 Daily Scheduled Merge for Real‑Time Data
For real‑time tasks that write to Hive, schedule a daily Spark 3 job to consolidate the previous day’s small files.
<code>set hive.exec.dynamic.partition.mode=nonstrict;
set spark.sql.hive.convertInsertingPartitionedTable=false;
set spark.sql.optimizer.insertRepartitionBeforeWriteIfNoShuffle.enabled=true;
insert overwrite table xxx.ods_kafka_xxxx partition(ds)
select id, xxx_date, xxx_type, ds
from xxx.ods_kafka_xxxx
where ds='${lst1date}'</code>2.6 Hive Parameters for Merging
<code>set hive.merge.mapfiles=true; -- merge map‑only task output
set hive.merge.mapredfiles=true; -- merge map‑reduce task output
set hive.input.format=org.apache.hadoop.hive.ql.io.CombineHiveInputFormat; -- combine before map phase</code>2.7 Spark 2 Setting
<code>set spark.sql.finalStage.adaptive.advisoryPartitionSizeInBytes=2048M;</code>Existing Small File Handling
3.1 Dynamic Partition Refresh with Spark 3
Example code to rewrite partitions dynamically:
<code>set hive.exec.dynamic.partition.mode=nonstrict;
set spark.sql.hive.convertInsertingPartitionedTable=false;
set spark.sql.optimizer.insertRepartitionBeforeWriteIfNoShuffle.enabled=true;
insert overwrite table xxx.xxx partition(ds)
select id, xxx_date, xxx_type, ds
from xxx.xxx
where ds<='2023-04-20' and ds>='2022-04-20';</code>3.2 Rebuild Table
If the table is unpartitioned, consider dropping and recreating it, then load data with Spark 3.
Problem Points Encountered
When using Spark 3 with dynamic partitions, fixed‑date partitions sometimes merge only one file, while dynamic partitions may leave many small files; the missing configuration
spark.sql.optimizer.insertRepartitionBeforeWriteIfNoShuffle.enabled=truecaused this.
Real‑time data ingestion still generates many small files; a daily Spark 3 job can re‑process historical data and then handle t‑1 files.
When Spark 3 writes to Hive used by Impala, the parameter
set spark.sql.hive.convertInsertingPartitionedTable=falsemust be added to ensure data visibility in Impala.
Tool‑Based Small File Governance
NetEase DataFlow EasyData provides a “Small File Governance” service that automatically generates Spark 3 scheduled tasks to merge files daily, validates results, and rolls back on failure, ensuring data quality.
Identify tables with high small‑file counts via trends and storage metrics.
Configure automatic daily scans and merges (excluding tables with very large partition counts).
Monitor task execution similar to regular offline job operations.
Effectiveness
Across X tables, the total small‑file count dropped from 1,217,927 to 680,133, achieving a 44.1% reduction.
Data Thinking Notes
Sharing insights on data architecture, governance, and middle platforms, exploring AI in data, and linking data with business scenarios.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.