Big Data 22 min read

T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices

This article details T3 Travel’s exploration of the Modern Data Stack, describing its four‑point overview, business scenarios, the initial MDS implementation using Apache Hudi and Kyuubi, and the design of a feature platform that integrates Metricflow, Feast, and other components to support data processing, analytics, and machine‑learning workflows.

DataFunTalk

Feb 25, 2023

T3 Travel’s Modern Data Stack and Feature Platform: Architecture and Practices

01 What is Modern Data Stack

The Modern Data Stack is a recent term referring to a suite of tools built around a data warehouse that simplify data handling for internal teams such as algorithms, data processing, and analytics, thereby improving overall decision‑making efficiency.

1. Characteristics of the Modern Data Stack

It combines various big‑data components to address complex data processing scenarios, aiming to seamlessly integrate and manage data pipelines.

2. Why a Modern Data Stack?

Historically, enterprises were limited to a few traditional databases (e.g., Oracle, IBM). With growing data volume, richer application ecosystems, and cloud adoption, costs have dropped and choices have expanded, allowing tailored, cost‑effective architectures.

3. Composition of the Modern Data Stack

Four layers: unified storage (eliminate data silos), data processing (ETL, scheduling), data analysis (extract insights), and data intelligence (large‑scale ML/DL).

02 T3 Travel Business Scenarios

T3 Travel is a smart mobility platform generating massive, diverse data from vehicle networks. Traditional data warehouses struggled with long‑tail order payments, unstructured data, and numerous small files.

1. Long‑Tail Payments

Order payment cycles can span months, creating extended business windows and costly cascade updates.

2. Unstructured Data & Small Files

Besides structured records, T3 handles audio‑video, radar point‑cloud, and log data, leading to many small files and low‑latency requirements.

3. Algorithmic Business Scenarios

Marketing (user profiling, ads), risk control (safety, liability), and fleet dispatch (vehicle management) rely on processed data.

03 T3 Travel MDS Initial Build

The initial modern stack centers on Apache Hudi and Apache Kyuubi.

1. Apache Hudi

Hudi provides a streaming lake‑warehouse platform with atomic updates, supporting copy‑on‑write and merge‑on‑read table formats, multiple query modes, and object‑storage integration (OBS, OSS, S3).

2. Apache Kyuubi

Kyuubi acts as a unified Thrift JDBC/ODBC gateway, adding multi‑tenant and high‑availability features to Spark Thrift Server and extending support to Doris, Trino, Presto, and Flink.

3. Data Analysis Process

Analysts use HUE or BI tools to connect through Kyuubi, query Spark‑processed Hudi data, and visualize results.

4. Data Processing Process

Dolphin Scheduler orchestrates Spark jobs via Kyuubi, achieving tenant‑isolated resource allocation; the system handles over 50,000 daily tasks reliably.

5. Overall Data Lake Architecture

The architecture combines Hudi for storage, Kyuubi as a unified gateway, Dolphin Scheduler for orchestration, OBS for object storage, and Hive Metastore for metadata.

04 Feature Platform On MDS

1. Model Development Workflow

Data engineers collect raw data, Spark cleans and extracts feature datasets, which are then used by algorithm engineers for model training and deployment; online services consume these features for inference.

2. Feature Platform Role

The platform centralizes feature metadata, provides ETL capabilities, and reduces latency by moving preprocessing out of online model services.

3. Overall Feature Platform Process

Features are extracted, processed, and managed centrally, then served to both offline model training and online inference.

4. Technology Stack Selection

Metricflow is used for metric definition and SQL generation; Feast provides offline and online feature stores; custom extensions support non‑structured OBS data and custom attributes.

(1) Metricflow

Metricflow translates simple metric definitions into executable SQL, supports Python SDK for Jupyter analysis, and materializes metrics for fast access.

(2) Dataset Semantics

Extends Metricflow to define dataset names, owners, descriptions, and query logic, handling both structured Hive/Kyuubi sources and unstructured OBS files.

(3) Feast – Feature Store

Feast offers unified offline and online feature storage, exposing RESTful APIs via Python/Java SDKs, ensuring feature consistency and low‑latency serving.

(4) Metadata Management

Combines Metricflow and Feast to manage dataset definitions, custom attributes for video and vehicle‑network data, and Hive Metastore for structured tables.

5. Internal Architecture

Two pipelines: offline processing (Spark cleanses data, stores features in Feast, UI for consumption) and real‑time processing (Kafka streams, feature transforms, storage).

6. Feature Platform On MDS Architecture

The platform provides a unified entry point for BI, feature, and algorithm services, leveraging Kyuubi for query routing, Dolphin Scheduler for task orchestration, and OBS/YARN (future K8s) for resource management.

05 Summary

The Modern Data Stack simplifies data management, allowing teams to focus on data value rather than infrastructure. T3 Travel’s feature platform built on a data lake demonstrates how such a stack can accelerate business development while reducing operational costs.

06 Q&A

Q1: Which team handles feature computation?

Feature engineering is performed by the algorithm team; the platform empowers them to self‑serve data without heavy reliance on the data‑warehouse team.

Q2: Is risk control a proprietary component?

Risk control typically combines custom strategies and algorithms; there is no single off‑the‑shelf component.

Q3: Core components of feature engineering?

Raw data is processed (e.g., via bagging), stored in Feast, while Hudi handles underlying data storage.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Big Data data lake feature platform Kyuubi Apache Hudi modern data stack

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.