Managing Database Intermediate Tables with File Storage Using SPL
The article explains how excessive intermediate tables generated by reporting workloads degrade database storage and performance, and proposes using the SPL data‑processing tool to store these intermediate results as external files, thereby reducing capacity pressure, improving I/O speed, and simplifying management.
Many data‑management professionals are troubled by the large number of intermediate tables in databases, which are not essential base data but are created by report calculations and accumulate over time, severely affecting database management, performance, and capacity, while being risky to delete.
These problematic intermediate tables are largely produced by front‑end reporting workloads.
They arise for several reasons: when raw data volume is huge, pre‑aggregated tables are stored to improve user experience; complex calculations that cannot be completed in a single SQL require storing intermediate results; and multiple reports often need the same intermediate results, leading to duplicated tables.
External data sources also contribute to intermediate tables when non‑relational data (e.g., JSON, XML, text files) are periodically imported into the database for joint processing.
The hazards of massive intermediate tables include consuming large storage space, triggering capacity alerts that force costly horizontal or vertical scaling, and competing for database compute resources, which degrades overall performance.
The proposed solution is to keep the concept of intermediate data but avoid persisting them in the database; instead, store them as files. Historically this was limited by technology, but the integration of SPL (a popular, free, open‑source data‑processing engine) into Runqian reports now makes file‑based computation feasible.
SPL is widely used for data calculations, offering powerful yet simple capabilities for engineers who regularly handle data processing tasks.
With SPL, reports can compute directly on file data (plain TXT or SPL’s high‑performance binary format) instead of database tables. For example, to query regions with 2012 sales exceeding 8 million, the workflow can import a file, filter by year, group by area, and select high‑value records:
=file("order_year_area_person.txt").import@t() =A1.select(dyear==y_date) =A2.groups(area;count(name):pnum,sum(amount):amount) =A3.select(amount>8000)Alternatively, the same logic can be expressed in SQL syntax:
=connect() =A1.query("select area,count(name) pnum,sum(amount) amount from order_year_area_person.txt where dyear=? group by area having sum(amount)>8000".y_date)Storing intermediate results as files simplifies management: the database only needs to keep raw data tables, while files can be organized in a hierarchical directory structure, easily deleted when reports are retired, eliminating the fear of accidental data loss.
File storage also offers superior I/O performance because the file system operates closer to the disk. Using SPL’s binary file format further improves throughput with compact storage, optional compression, and no transaction overhead.
SPL additionally supports heterogeneous data sources, allowing multi‑source mixed calculations without importing data into relational databases, thereby reducing the need for additional intermediate tables.
In summary, intermediate data are valuable for reporting, but their proliferation in the database creates problems; SPL moves these intermediate tables out of the database while preserving their benefits, resulting in a more efficient and manageable data architecture.
Java Captain
Focused on Java technologies: SSM, the Spring ecosystem, microservices, MySQL, MyCat, clustering, distributed systems, middleware, Linux, networking, multithreading; occasionally covers DevOps tools like Jenkins, Nexus, Docker, ELK; shares practical tech insights and is dedicated to full‑stack Java development.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.