Alink: A Flink‑Based Machine Learning Platform – Overview, Features, and Quick‑Start Guide
This article introduces Alink, Alibaba's open‑source machine‑learning platform built on Flink, explains its core algorithms, performance comparison with Spark ML, version‑wise feature evolution, and provides practical quick‑start instructions for both Java (Maven) and Python (PyAlink) users, including data source handling, type conversion components, unified file‑system operations, and an overview of its FM algorithm implementation.
Guest: Yang Xu, Senior Algorithm Expert at Alibaba.
Editor: Zhu Rong.
Overview: Alink is a machine‑learning platform built on Flink that supports both batch and streaming processing. It offers a rich algorithm library for statistical analysis, real‑time prediction, personalized recommendation, and anomaly detection. Alink provides Java APIs and a Python API called PyAlink, which can be deployed on single‑node or cluster environments via Jupyter, Zeppelin, etc. The project was open‑sourced at the Flink Forward Asia conference in November 2019 and has been continuously iterated.
Key Topics Covered:
Basic introduction of Alink.
Quick start with Alink.
1. What is Alink? Alink, developed by Alibaba's Computing Platform Division, combines the common parts of the words Alibaba, Algorithm, AI, Flink, and Blink. It provides a comprehensive algorithm library that natively supports both batch and stream processing, covering the entire ML workflow from data preprocessing and feature engineering to model training and prediction. Java API targets engineers integrating Alink into existing systems, while PyAlink enables rapid experimentation for data scientists.
2. Alink Features Alink includes 62 functional points across 13 categories, such as classification, clustering, regression, model evaluation, association rules, collaborative filtering, and similarity algorithms. It also offers data‑preprocessing tools, anomaly detection, text processing, online learning via FTRL, and parameter‑tuning services.
3. Performance Comparison Benchmarks against Spark ML show that Alink generally outperforms Spark in most algorithms, with a few exceptions, achieving comparable overall performance.
4. Development History
July 2019 (v1.2.0): Multi‑version Flink support (1.11, 1.10, 1.9), multiple file systems (local, HDFS, OSS), CSV I/O, AK format I/O, model summary, FM classification/regression.
June 2019 (v1.1.2): Added 30 data‑format conversion components, Hive source support, SQL Select in Pipeline and LocalPredictor.
April 2019 (v1.1.1): Smarter parameter checking.
February 2019 (v1.1.0): Support for Flink 1.1.0/1.9, PyFlink compatibility, improved UDF/UDTF, Maven/ PyPI installation, multi‑version Kafka source.
December 2018 (v1.0.1): Fixed Windows installation issues.
November 2018 (v1.0): First open‑source release at Flink Forward Asia.
5. Quick‑Start Guide
5.1 Maven Build Create a Maven project, add Alink dependencies in pom.xml , copy the Java demo code, then build and run.
5.2 PyAlink Installation Install via PyPI, ensuring the appropriate OS environment (macOS, Windows, Alibaba Cloud). Uninstall older versions if necessary.
5.3 Running PyAlink in Notebooks Supports both local and cluster execution; since v1.1.1, cluster address specification has been simplified.
5.4 Integration with PyFlink Alink operators can be converted to PyFlink Table objects, enabling seamless workflow composition. The getMLEnv interface allows direct submission of Python scripts to a Flink cluster (e.g., python keans.py ).
6. Data Sources Alink supports five data‑source types: batch (file, Hive, MySQL, in‑memory) and streaming (Kafka). Example: a logical‑regression model consuming Kafka data involves defining the Kafka source, parsing JSON, loading the model, predicting, and writing results back to Kafka.
7. Type Conversion Components Alink provides components to convert among Triplet, CSV, JSON, KV, Columns, and Vector types, with 30 batch and 25 streaming conversion operators. Component naming follows the pattern SourceToTargetBatchOp() or SourceToTargetStreamOp() .
8. Unified File‑System Operations A unified API abstracts local, HDFS, and OSS file systems, offering consistent methods for reading, writing, copying, and streaming files across environments.
9. FM Algorithm Alink implements Factorization Machines (FM) for large‑scale sparse data, offering linear‑time complexity. FM balances model expressiveness and computational cost by representing each feature with a low‑dimensional vector, avoiding the quadratic blow‑up of full second‑order models.
Open‑Source Repository https://github.com/alibaba/Alink
Thank you for reading! Please like, share, and give a three‑click boost.
DataFunTalk
Dedicated to sharing and discussing big data and AI technology applications, aiming to empower a million data scientists. Regularly hosts live tech talks and curates articles on big data, recommendation/search algorithms, advertising algorithms, NLP, intelligent risk control, autonomous driving, and machine learning/deep learning.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.