Artificial Intelligence 13 min read

End-to-End Machine Learning Application Using OpenMLDB and Alibaba Cloud MaxCompute

This article demonstrates how to build a complete end-to-end machine-learning workflow for taxi trip duration prediction by integrating OpenMLDB with Alibaba Cloud MaxCompute’s serverless services, covering environment setup, offline data ingestion, feature extraction, model training, deployment, and real-time online inference within 20 ms.

DataFunTalk

Dec 13, 2022

End-to-End Machine Learning Application Using OpenMLDB and Alibaba Cloud MaxCompute

OpenMLDB is an open‑source machine learning database that provides a full‑stack feature store, low‑barrier SQL development experience, and online/offline feature computation capabilities.

MaxCompute is Alibaba Cloud’s serverless, fully managed data‑warehouse service that enables large‑scale data processing and integrates with other cloud products such as DataWorks, PAI, and Quick BI.

The integration of OpenMLDB with MaxCompute (released in v0.7.0) allows developers to build end‑to‑end AI applications entirely on the cloud, from data ingestion to real‑time inference.

Workflow steps :

Environment preparation : create an Alibaba Cloud account and download the MaxCompute and OpenMLDB packages.

Offline data ingestion : upload the CSV sample data to OSS, then use DataWorks to convert it into a MaxCompute table.

./ossutil mkdir oss://tobe-bucket/openmldb_maxcompute_demo2/

./ossutil cp ./taxi-trip/data/taxi_tour_table_train_simple.csv oss://tobe-bucket/openmldb_maxcompute_demo2/

Feature extraction (offline) : write OpenMLDB SQL and execute it with the OpenMLDB offline engine on Spark. Example Spark session configuration:

val spark = SparkSession.builder()
  .appName("OpenmldbExportMaxcomputeTable")
  .config("spark.sql.defaultCatalog","odps")
  .config("spark.sql.catalog.odps", "org.apache.spark.sql.execution.datasources.v2.odps.OdpsTableCatalog")
  .config("spark.sql.sources.partitionOverwriteMode", "dynamic")
  .config("spark.sql.extensions", "org.apache.spark.sql.execution.datasources.v2.odps.extension.OdpsExtensions")
  .config("spark.sql.catalogImplementation","hive")
  .getOrCreate()

val sess = new OpenmldbSession(spark)
val sql = """
SELECT trip_duration, passenger_count,
  sum(pickup_latitude) OVER w AS vendor_sum_pl,
  max(pickup_latitude) OVER w AS vendor_max_pl,
  min(pickup_latitude) OVER w AS vendor_min_pl,
  avg(pickup_latitude) OVER w AS vendor_avg_pl,
  sum(pickup_latitude) OVER w2 AS pc_sum_pl,
  max(pickup_latitude) OVER w2 AS pc_max_pl,
  min(pickup_latitude) OVER w2 AS pc_min_pl,
  avg(pickup_latitude) OVER w2 AS pc_avg_pl,
  count(vendor_id) OVER w2 AS pc_cnt,
  count(vendor_id) OVER w AS vendor_cnt
FROM t1
WINDOW w AS (PARTITION BY vendor_id ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW),
       w2 AS (PARTITION BY passenger_count ORDER BY pickup_datetime ROWS_RANGE BETWEEN 1d PRECEDING AND CURRENT ROW)
""".stripMargin
val outputDf = sess.sql(sql).getSparkDf()

Offline model training : export the feature table from MaxCompute, then train a model with LightGBM (or TensorFlow/PyTorch). Example command:

python3 train.py /tmp/feature_data /tmp/model.txt

Deploy feature‑extraction SQL to the online engine : start an OpenMLDB cluster, create the target table, and use the DEPLOY statement.

./sbin/deploy-all.sh
./sbin/start-all.sh
./openmldb_cli.sh
show components
CREATE DATABASE demo_db;
USE demo_db;
CREATE TABLE t1(...);
SET @@execute_mode='online';
DEPLOY demo SELECT trip_duration, passenger_count,
  sum(pickup_latitude) OVER w AS vendor_sum_pl,
  ...
FROM t1
WINDOW w AS (...), w2 AS (...);

Online prediction : load a small sample into the online table, start the prediction service, and send a request.

> USE demo_db;
> SET @@execute_mode='online';
> LOAD DATA INFILE 'file:///work/taxi-trip/data/taxi_tour_table_train_simple.csv' INTO TABLE t1 options(format='csv', header=true, mode='append');
./start_predict_server.sh 127.0.0.1:9080 /tmp/model.txt
python3 predict.py

The whole pipeline can achieve feature‑driven inference in less than 20 ms , demonstrating a practical end‑to‑end AI solution on Alibaba Cloud.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

SQL MaxCompute online inference Feature Store OpenMLDB

Written by

DataFunTalk

Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.