Using Fluid Cloud‑Native Data Caching to Boost Performance and Elasticity of a Quantitative Research Platform on Alibaba Cloud
This article describes how JoinQuant built a cloud‑native quantitative research platform on Alibaba Cloud, identified performance, cost, data‑management, and security challenges, and solved them with Fluid’s JindoRuntime data‑caching, elastic scaling, and Python‑driven workflows, achieving dramatic speed and cost improvements.
Background
Quantitative investment relies on data‑driven decisions, and JoinQuant uses large‑scale market data, AI, and automated trading. Their research workflow includes factor mining, return prediction, portfolio optimization, and back‑testing, all of which are data‑intensive tasks running on Alibaba Cloud services such as ECS, ECI, ACK, NAS, and OSS.
Challenges
The platform faced performance bottlenecks, high and variable bandwidth costs, complex data management across NAS and OSS, data‑security isolation, a steep learning curve for Kubernetes/YAML, and the need for dynamic data‑source mounting without restarting Jupyter notebooks.
Solution Overview
JoinQuant discovered that the native Kubernetes CSI could not meet multi‑source acceleration needs, so they adopted the CNCF Fluid project. Fluid provides a unified way to manage and accelerate multiple Persistent Volume Claims (PVCs) from OSS and NAS, with JindoRuntime offering the best performance and stability.
Key capabilities include:
Per‑data‑type storage policies (read‑only for training data, read‑write for feature data and checkpoints) via Fluid Datasets.
Elastic scaling of cache workers on high‑IO, large‑memory ECS/ECI instances, including Spot instances for cost savings.
Scheduled cache scaling (CronHorizontalPodAutoscaler) to match workload tides.
Data pre‑warming and metadata synchronization to keep caches up‑to‑date.
Namespace‑based isolation for secure multi‑team data access while allowing shared public datasets.
Python SDK for end‑to‑end dataset creation, runtime binding, cache scaling, and pre‑loading, eliminating the need for YAML.
Implementation Details
Example Fluid Dataset definitions (YAML) and Python SDK usage are shown below. Code snippets are kept intact:
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: training-data
spec:
mounts:
- mountPoint: "pvc://nas/training-data"
path: "/training-data"
accessModes: ReadOnlyMany
---
apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
name: checkpoint
spec:
mounts:
- mountPoint: "pvc://nas/checkpoint"
path: "/checkpoints"
accessModes: ReadWriteManyPython example:
import fluid
from fluid import constants, models
# Connect to the cluster
client_config = fluid.ClientConfig()
fluid_client = fluid.FluidClient(client_config)
# Create a read‑only Dataset
fluid_client.create_dataset(dataset_name="mydata", mount_name="/", mount_point="pvc://static-pvc-nas/mydata")
# Bind JindoRuntime and scale cache
dataflow = dataset.bind_runtime(runtime_type=constants.JINDO_RUNTIME_KIND, replicas=1, cache_capacity_GiB=30, cache_medium="MEM", wait=True).scale_cache(replicas=2).preload(target_path="/train")
# Run the dataflow
run = dataflow.run()
run.wait()Performance Evaluation
Tests with up to 100 concurrent Pods showed Fluid reducing average data‑access time from 15 minutes to 38.5 seconds and cutting compute cost to one‑tenth, thanks to bandwidth scaling with JindoRuntime replicas.
Summary and Outlook
Fluid provides elastic, high‑performance data caching that integrates with Kubernetes scaling, enabling flexible, cost‑effective quantitative research. Future work includes tighter coupling of task and cache elasticity, and improving Dataflow data‑affinity for better node locality.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.