Optimizing Large‑Scale Excel Import in Java: From POI to EasyExcel with Caching, Batch Inserts, and Parallel Streams
This article describes how to dramatically speed up the import of massive Excel files in a Java backend by replacing raw POI with EasyExcel, caching database lookups, using MySQL batch inserts, and employing parallel streams to achieve sub‑minute processing for hundreds of thousands of rows.
The project required importing payment records from an Excel template into the system, converting the data into arrears, bills, and bill details, and persisting them to a MySQL 5.7 database. The original implementation handled only a few hundred rows, but version 4.0 needed to process over 100,000 rows and insert more than 300,000 database records.
To meet this demand, the author iteratively refactored the import code, documenting each performance bottleneck and its solution.
First Version: POI + Row‑by‑Row Validation + Row‑by‑Row Insert
The initial version used raw Apache POI to read each row into a List<ArrayList> , then performed per‑row field validation, database lookups for duplicate checks, and inserted each row individually. This approach caused massive network I/O and CPU overhead.
Second Version: EasyPOI + Cached Database Queries + Batch Insert
Three optimizations were applied:
Cache frequently queried data (e.g., house information) in a HashMap to replace per‑row database lookups.
Introduce a custom SessionMapper that uses MapResultHandler to load the cache in a single query.
Use MySQL INSERT … VALUES batch syntax via MyBatis <foreach> to insert many rows at once. @Repository public class SessionMapper extends SqlSessionDaoSupport { @Resource public void setSqlSessionFactory(SqlSessionFactory sqlSessionFactory) { super.setSqlSessionFactory(sqlSessionFactory); } // 区域楼宇单元房号 - 房屋ID @SuppressWarnings("unchecked") public Map getHouseMapByAreaId(Long areaId) { MapResultHandler handler = new MapResultHandler(); this.getSqlSession().select(BaseUnitMapper.class.getName()+".getHouseMapByAreaId", areaId, handler); Map map = handler.getMappedResults(); return map; } } public class MapResultHandler implements ResultHandler { private final Map mappedResults = new HashMap(); @Override public void handleResult(ResultContext context) { @SuppressWarnings("rawtypes") Map map = (Map)context.getResultObject(); mappedResults.put(map.get("key"), map.get("value")); } public Map getMappedResults() { return mappedResults; } } Batch insert example: <insert id="insertList"> insert into table(col1, col2) values <foreach collection="list" item="item" separator=","> (#{item.col1}, #{item.col2}) </foreach> </insert> Third Version: EasyExcel + Cached Queries + Batch Insert Switching to Alibaba's EasyExcel reduced memory consumption and read time dramatically (41 w rows, 25 columns, 45.5 MB read in ~50 s). The same caching and batch‑insert strategies were retained. Fourth Version: Optimizing Insert Batch Size and Parallelism Experiments showed that inserting 1,000 rows per batch yielded the best throughput, likely because larger batches exceeded the InnoDB buffer pool and caused disk swapping. Parallel streams were then used to insert batches concurrently, achieving 72 s for 100 k rows and 42 w related records. /** * Function: Quickly insert data using parallel streams * @author Keats * @date 2020/7/1 9:25 */ public class InsertConsumer { private static final int SIZE = 1000; static { System.setProperty("java.util.concurrent.ForkJoinPool.common.parallelism", "4"); } public static void insertData(List list, Consumer > consumer) { if (list == null || list.size() < 1) return; List > streamList = new ArrayList<>(); for (int i = 0; i < list.size(); i += SIZE) { int j = Math.min(i + SIZE, list.size()); streamList.add(list.subList(i, j)); } streamList.parallelStream().forEach(consumer); } } Usage example: InsertConsumer.insertData(feeList, arrearageMapper::insertList); Other Performance Tips Heavy logging inside loops (e.g., printing INFO for every 1,000 rows) can increase processing time by an order of magnitude. Disabling such logs reduced the validation phase to roughly 10 % of its original duration. Conclusion Use a fast Excel library such as EasyExcel. Cache data needed for database validation to trade space for time. Batch insert rows with INSERT … VALUES syntax. Leverage multithreading (parallel streams) to hide I/O latency. Avoid unnecessary logging inside tight loops.
Code Ape Tech Column
Former Ant Group P8 engineer, pure technologist, sharing full‑stack Java, job interview and career advice through a column. Site: java-family.cn
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.