Artificial Intelligence 11 min read

Alink: An Open‑Source Machine Learning Platform on Flink – Features, Performance, and Quick‑Start Guide

This article introduces Alink, Alibaba's open‑source machine‑learning platform built on Flink, detailing its core algorithms, performance advantages over Spark ML, version evolution, Maven and PyAlink installation steps, data‑source integrations, FM algorithm support, and unified file‑system operations for both batch and streaming workloads.

DataFunSummit
DataFunSummit
DataFunSummit
Alink: An Open‑Source Machine Learning Platform on Flink – Features, Performance, and Quick‑Start Guide

Alink is a machine‑learning platform developed by Alibaba's Computing Platform Division, based on Flink's unified batch‑and‑stream processing model. It offers a rich algorithm library covering classification, clustering, regression, recommendation, and anomaly detection, with Java and Python (PyAlink) APIs for easy integration.

The platform provides 13 algorithm categories and 62 functional points, including model evaluation methods, data preprocessing tools, online learning via FTRL, and parameter tuning services. Performance tests show Alink often outperforms Spark ML, with comparable results on most algorithms.

Since its open‑source release at Flink Forward Asia 2019, Alink has seen multiple version updates: v1.2.0 added multi‑version Flink support, various file‑system connectors (local, HDFS, OSS), CSV/AK I/O components, and FM classification/regression algorithms; earlier releases introduced Hive connectors, enhanced UDF/UDTF capabilities, and Maven/PyPI installation options.

Quick‑start guides cover building an Alink project with Maven (four steps: create project, add Alink dependency, copy demo code, build and run) and installing PyAlink via PyPI (handling OS‑specific environment setup, version compatibility, and uninstalling older releases).

Running PyAlink jobs can be done locally or on a cluster, with improved cluster address configuration after v1.1.1. Integration with PyFlink allows seamless conversion between Alink operators and Flink Table APIs, enabling submission of Python scripts directly to a Flink cluster.

Alink supports multiple data sources: batch sources (files, Hive, MySQL, in‑memory) and streaming sources (Kafka). Examples demonstrate reading/writing Kafka streams, parsing JSON to columns, and handling log strings, using components such as JsonToColumnsStreamOp.

The platform includes comprehensive type‑conversion components (e.g., TripleToJsonBatchOp) covering six data formats (Triple, CSV, JSON, KV, Columns, Vector) with 30 batch and 25 streaming operators, following a consistent naming convention.

Alink also provides a unified file‑system interface that abstracts local, HDFS, and OSS storage, offering standard methods for file creation, reading, writing, and copying across environments.

For large‑scale sparse data, Alink implements a factorization‑machine (FM) algorithm with linear computational complexity, offering a practical trade‑off between model expressiveness and efficiency.

The article concludes with the open‑source repository link (https://github.com/alibaba/Alink) and acknowledgments.

machine learningFlinkData ProcessingstreamingAlinkPyAlink
DataFunSummit
Written by

DataFunSummit

Official account of the DataFun community, dedicated to sharing big data and AI industry summit news and speaker talks, with regular downloadable resource packs.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.