End-to-End Machine Learning Application Using OpenMLDB and Alibaba Cloud MaxCompute
This article demonstrates how to build a complete end-to-end machine-learning workflow for taxi trip duration prediction by integrating OpenMLDB with Alibaba Cloud MaxCompute’s serverless services, covering environment setup, offline data ingestion, feature extraction, model training, deployment, and real-time online inference within 20 ms.
OpenMLDB is an open‑source machine learning database that provides a full‑stack feature store, low‑barrier SQL development experience, and online/offline feature computation capabilities.
MaxCompute is Alibaba Cloud’s serverless, fully managed data‑warehouse service that enables large‑scale data processing and integrates with other cloud products such as DataWorks, PAI, and Quick BI.
The integration of OpenMLDB with MaxCompute (released in v0.7.0) allows developers to build end‑to‑end AI applications entirely on the cloud, from data ingestion to real‑time inference.
Workflow steps :
Environment preparation : create an Alibaba Cloud account and download the MaxCompute and OpenMLDB packages.
Offline data ingestion : upload the CSV sample data to OSS, then use DataWorks to convert it into a MaxCompute table. ./ossutil mkdir oss://tobe-bucket/openmldb_maxcompute_demo2/ ./ossutil cp ./taxi-trip/data/taxi_tour_table_train_simple.csv oss://tobe-bucket/openmldb_maxcompute_demo2/
Feature extraction (offline) : write OpenMLDB SQL and execute it with the OpenMLDB offline engine on Spark. Example Spark session configuration: val spark = SparkSession.builder() .appName("OpenmldbExportMaxcomputeTable") .config("spark.sql.defaultCatalog","odps") .config("spark.sql.catalog.odps", "org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog") .config("spark.sql.sources.partitionOverwriteMode", "dynamic") .config("spark.sql.extensions", "org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions") .config("spark.sql.catalogImplementation","hive") .getOrCreate() val sess = new OpenmldbSession(spark) val sql = """ SELECT trip_duration, passenger_count, sum(pickup_latitude) OVER w AS vendor_sum_pl, max(pickup_latitude) OVER w AS vendor_max_pl, min(pickup_latitude) OVER w AS vendor_min_pl, avg(pickup_latitude) OVER w AS vendor_avg_pl, sum(pickup_latitude) OVER w2 AS pc_sum_pl, max(pickup_latitude) OVER w2 AS pc_max_pl, min(pickup_latitude) OVER w2 AS pc_min_pl, avg(pickup_latitude) OVER w2 AS pc_avg_pl, count(vendor_id) OVER w2 AS pc_cnt, count(vendor_id) OVER w AS vendor_cnt FROM t1 WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW), w2 AS (PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW) """.stripMargin val outputDf = sess.sql(sql).getSparkDf()
Offline model training : export the feature table from MaxCompute, then train a model with LightGBM (or TensorFlow/PyTorch). Example command: python3 train.py /tmp/feature_data /tmp/model.txt
Deploy feature‑extraction SQL to the online engine : start an OpenMLDB cluster, create the target table, and use the DEPLOY statement. ./sbin/deploy-all.sh ./sbin/start-all.sh ./openmldb_cli.sh show components CREATE DATABASE demo_db; USE demo_db; CREATE TABLE t1(...); SET @@execute_mode='online'; DEPLOY demo SELECT trip_duration, passenger_count, sum(pickup_latitude) OVER w AS vendor_sum_pl, ... FROM t1 WINDOW w AS (...), w2 AS (...);
Online prediction : load a small sample into the online table, start the prediction service, and send a request. > USE demo_db; > SET @@execute_mode='online'; > LOAD DATA INFILE 'file:///work/taxi-trip/data/taxi_tour_table_train_simple.csv' INTO TABLE t1 options(format='csv', header=true, mode='append'); ./start_predict_server.sh 127.0.0.1:9080 /tmp/model.txt python3 predict.py
The whole pipeline can achieve feature‑driven inference in less than 20 ms , demonstrating a practical end‑to‑end AI solution on Alibaba Cloud.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.