Evolution and Architecture of the Transportation Division Data Warehouse
The article details how the Transportation Division’s data warehouse grew from a simple SQL‑based solution to a multi‑layer, big‑data platform handling petabyte‑scale data with daily 10 TB increments, describing the technical and business architecture, ETL strategies, and future roadmap.
The Transportation Division has experienced rapid growth, with data volume now reaching petabyte levels and daily increments of about 10 TB, prompting higher demands for scalability, stability, and usability of its data warehouse; a layered, thematic structure combined with ETL processes was adopted, and the article explains both the technical and business architecture of this warehouse.
In the early stage, the focus was on quickly supporting business needs, so the warehouse architecture was kept simple and designed only after business requirements were clear.
The first‑generation warehouse was built when the division only offered train tickets, with data volume less than 1 % of today’s size; analysts extracted data from a single commercial database using SQL, but the raw data was messy and inefficient, leading to the adoption of Microsoft BI tools (SSIS + database + CDC) for T+1 daily incremental ETL, template‑based development, monitoring, lineage, and retry mechanisms, supporting both offline and near‑real‑time applications.
As data grew, a single server became insufficient; long cleaning tasks (over 4 hours) and other issues led to a partnership with a public data center to build a second‑generation warehouse on a big‑data platform.
The second‑generation architecture became more complex, undergoing two major upgrades and integrating multiple clusters and databases; with over ten data sources, technologies such as SQOOP, KAFKA, MQ, API, XDATA, FLUME, STORM, and SSIS were combined to feed an ODS layer, followed by extensive offline cleaning, greatly improving scalability.
Today the warehouse handles petabyte‑scale data with daily 10 TB increments; a mixed cleaning workflow of full, incremental, and near‑real‑time processes is used. Log data is incrementally partitioned by day, while order data uses a six‑month incremental update, and numerous small and large tables are created to reduce Hadoop resource consumption, maintaining high ETL efficiency.
This architecture supports multiple projects (train tickets, ground transport, international travel, shipping) and enables data‑mining applications such as seat‑availability alerts, intelligent recommendations, transfer suggestions, and smart transportation services, providing a solid foundation for future commercial intelligence.
Initially, the business logic was simple, so no formal data‑mart layer was built; instead, lightly cleaned detail data was directly aggregated for reporting and dashboards.
The second‑generation warehouse now features five layers: a three‑layer foundation from the data center, a FACT layer for detailed data, and a CUBE (data‑mart) layer created at the end of 2016 to improve usability and extensibility, followed by an application layer.
In 2016, with a four‑fold increase in orders, the warehouse adopted a star schema for key business subjects (orders, tickets, revenue, members, etc.) and later transitioned to a snowflake model to better support multidimensional analysis and complex business coupling.
Future work includes automated warehousing, data‑exchange platforms, governance, security, real‑time data platforms, recommendation engines, self‑service multidimensional analysis, and machine‑learning platforms; emphasis will be on data accuracy, security, coverage, timeliness, and ease of use as data becomes increasingly critical for decision‑making.
The author reflects on two years of responsibility for the warehouse, emphasizing that the ultimate goal of any data‑warehouse effort is to enable business growth, and that a warehouse is valuable only when it delivers real business value.
Tongcheng Travel Technology Center
Pursue excellence, start again with Tongcheng! More technical insights to help you along your journey and make development enjoyable.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.