Backend Development 12 min read

Optimizing Large-Scale Excel Import in Java: From POI to EasyExcel with Caching, Batch Inserts, and Parallel Streams

This article details a step‑by‑step optimization of a high‑volume Excel import workflow in Java, covering the migration from raw POI to EasyPOI and EasyExcel, caching database lookups, using MySQL batch inserts, and leveraging parallel streams to reduce import time from minutes to under two minutes.

Code Ape Tech Column

Jun 28, 2023

Optimizing Large-Scale Excel Import in Java: From POI to EasyExcel with Caching, Batch Inserts, and Parallel Streams

The author describes a real‑world requirement to import over 100,000 rows of payment records from an Excel template into a MySQL database, which originally performed poorly due to per‑row database queries and inserts.

First version (POI + per‑row query + per‑row insert) read the Excel file into a List, then for each row performed field validation, database lookups for duplicate checks, and individual inserts. This approach caused massive network I/O and slow execution.

Second version (EasyPOI + cache + batch insert) introduced three improvements:

Cache lookup data in a HashMap to avoid repeated database queries.

Implement a custom SessionMapper using MapResultHandler to load house data into the cache.

Use MySQL INSERT ... VALUES(...),(...) syntax for batch insertion.

@Repository
public class SessionMapper extends SqlSessionDaoSupport {
    @Resource
    public void setSqlSessionFactory(SqlSessionFactory sqlSessionFactory) {
        super.setSqlSessionFactory(sqlSessionFactory);
    }
    @SuppressWarnings("unchecked")
    public Map<String, Long> getHouseMapByAreaId(Long areaId) {
        MapResultHandler handler = new MapResultHandler();
        this.getSqlSession().select(BaseUnitMapper.class.getName()+".getHouseMapByAreaId", areaId, handler);
        return handler.getMappedResults();
    }
}

public class MapResultHandler implements ResultHandler {
    private final Map mappedResults = new HashMap();
    @Override
    public void handleResult(ResultContext context) {
        Map map = (Map)context.getResultObject();
        mappedResults.put(map.get("key"), map.get("value"));
    }
    public Map getMappedResults() { return mappedResults; }
}

@Mapper
@Repository
public interface BaseUnitMapper {
    Map<String, Long> getHouseMapByAreaId(@Param("areaId") Long areaId);
}

<select id="getHouseMapByAreaId" resultMap="mapResultLong">
    SELECT CONCAT(h.bulid_area_name, h.build_name, h.unit_name, h.house_num) k,
           h.house_id v
    FROM base_house h
    WHERE h.area_id = #{areaId}
    GROUP BY h.house_id
</select>
<resultMap id="mapResultLong" type="java.util.HashMap">
    <result property="key" column="k" javaType="string" jdbcType="VARCHAR"/>
    <result property="value" column="v" javaType="long" jdbcType="INTEGER"/>
</resultMap>

Batch insert statements were written as:

<insert id="insertList">
    insert into table(colom1, colom2)
    values
    <foreach collection="list" item="item" separator=",">
        (#{item.colom1}, #{item.colom2})
    </foreach>
</insert>

When the data volume grew to 410,000 rows, EasyPOI caused OOM; the author switched to Alibaba's EasyExcel, which reads the same size in about 50 seconds.

Third version (EasyExcel + cache + batch insert) kept the caching strategy and batch inserts while using EasyExcel for faster reading.

Fourth version (Optimizing batch size) experimented with different batch sizes (30,000; 10,000; 5,000; 3,000; 1,000; 500) and found 1,000 rows per batch gave the best performance, likely due to InnoDB buffer limits.

To fully utilize CPU while waiting for I/O, a parallel‑stream based utility class was created:

public class InsertConsumer {
    private static final int SIZE = 1000;
    static { System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "4"); }
    public static <T> void insertData(List<T> list, Consumer<List<T>> consumer) {
        if (list == null || list.isEmpty()) return;
        List<List<T>> streamList = new ArrayList<>();
        for (int i = 0; i < list.size(); i += SIZE) {
            int j = Math.min(i + SIZE, list.size());
            streamList.add(list.subList(i, j));
        }
        streamList.parallelStream().forEach(consumer);
    }
}

Usage example:

InsertConsumer.insertData(feeList, arrearageMapper::insertList);

The author also highlighted the impact of excessive logging: printing info logs inside loops increased execution time tenfold, so removing them dramatically improved speed.

Summary of optimization tips :

Use a fast Excel library such as EasyExcel.

Cache data needed for validation to trade space for time.

Batch insert with INSERT ... VALUES(...),(...).

Leverage multithreading (parallel streams) for concurrent inserts.

Avoid unnecessary logging inside tight loops.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Java Performance mysql Batch Insert easyexcel Excel Parallel Stream

Written by

Code Ape Tech Column

Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.