Improving OSS Small‑File Access Performance with StrmVol Storage Volumes in Kubernetes
StrmVol storage volumes replace the FUSE‑based OSS mount with a virtual block device and kernel‑mode file system, dramatically reducing latency for massive small‑file reads in Kubernetes workloads such as AI training datasets, and the article demonstrates setup, configuration, and performance testing using Argo Workflows.
Object Storage Service (OSS) is widely used for massive unstructured data, but accessing millions of small files through the traditional FUSE‑based CSI driver incurs high latency due to frequent user‑kernel context switches and metadata overhead.
The StrmVol storage volume, supported by Alibaba Cloud Container Service (ACK), eliminates the FUSE middle‑layer by exposing a virtual block device backed by a kernel‑mode file system such as EROFS, thereby shortening the data path and accelerating read performance for read‑only, small‑file workloads like AI training sets and time‑series log analysis.
Core mechanisms and optimizations
Fast index construction : only file metadata (name, path, size) is synchronized, reducing initialization time.
Memory prefetch : concurrent pre‑fetch of data blocks based on the index lowers I/O wait.
Kernel‑mode file system : direct reads from memory avoid user‑space FUSE overhead; EROFS provides compression and efficient access.
Applicable scenarios
Read‑only workloads with massive small files (e.g., AI training image sets).
Data stored in OSS that does not require frequent updates.
Random‑read patterns where low latency is critical.
To use StrmVol, deploy the strmvol-csi-driver component from the ACK marketplace. After installation, define a PersistentVolume (PV) and PersistentVolumeClaim (PVC) similar to standard OSS volumes:
apiVersion: v1
kind: PersistentVolume
metadata:
name: pv-strmvol
spec:
capacity:
# Up to 16 TiB can be stored under the OSS mount point.
storage: 20Gi
accessModes:
- ReadOnlyMany
persistentVolumeReclaimPolicy: Retain
csi:
driver: strmvol.csi.alibabacloud.com
volumeHandle: pv-strmvol
nodeStageSecretRef:
name: strmvol-secret
namespace: default
volumeAttributes:
bucket: imagenet
path: /data
url: oss-cn-hangzhou-internal.aliyuncs.com
directMode: "false"
resourceLimit: "4c8g"
---
kind: PersistentVolumeClaim
apiVersion: v1
metadata:
name: pvc-strmvol
namespace: default
spec:
accessModes:
- ReadOnlyMany
resources:
requests:
storage: 20Gi
volumeName: pv-strmvolThe directMode flag controls whether prefetch and local caching are disabled (useful for pure random‑read scenarios). resourceLimit defines the maximum CPU and memory the virtual block device may consume on the node (e.g., "4c8g" = 4 vCPU, 8 GiB RAM).
Performance testing is performed with an Argo Workflow that simulates distributed image‑set loading. The workflow consists of three stages: listing shard directories, parallel processing of each shard using GNU parallel , and aggregating timing results.
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata:
generateName: distributed-imagenet-training-
spec:
entrypoint: main
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorTerms:
- matchExpressions:
- key: "node-type"
operator: In
values:
- "argo"
volumes:
- name: pvc-volume
persistentVolumeClaim:
claimName: pvc-strmvol
templates:
- name: main
steps:
- - name: list-shards
template: list-imagenet-shards
- - name: parallel-processing
template: process-shard
arguments:
parameters:
- name: paths
value: "{{item}}"
withParam: "{{steps.list-shards.outputs.result}}"
- - name: calculate-statistics
template: calculate-averages
- name: list-imagenet-shards
script:
image: mirrors-ssl.aliyuncs.com/python:latest
command: [python]
source: |
import subprocess, json
output = subprocess.check_output("ls /mnt/data", shell=True, text=True)
files = [f for f in output.split('\n') if f]
print(json.dumps(files, indent=2))
volumeMounts:
- name: pvc-volume
mountPath: /mnt/data
- name: process-shard
inputs:
parameters:
- name: paths
container:
image: alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/alinux3:latest
command: [/bin/bash, -c]
args:
- |
yum install -y parallel
SHARD_JSON="/mnt/data/{{inputs.parameters.paths}}"
START_TIME=$(date +%s)
find "$SHARD_JSON" -maxdepth 1 -name "*.JPEG" -print0 | parallel -0 -j4 'cp {} /dev/null'
END_TIME=$(date +%s)
ELAPSED=$((END_TIME - START_TIME))
mkdir -p /tmp/output
echo $ELAPSED > /tmp/output/time_shard_{{inputs.parameters.paths}}.txt
resources:
requests:
memory: "4Gi"
cpu: "1000m"
limits:
memory: "4Gi"
cpu: "2000m"
volumeMounts:
- name: pvc-volume
mountPath: /mnt/data
outputs:
artifacts:
- name: time_shard
path: /tmp/output/time_shard_{{inputs.parameters.paths}}.txt
oss:
key: results/results-{{workflow.creationTimestamp}}/time_shard_{{inputs.parameters.paths}}.txt
archive: {}
- name: calculate-averages
inputs:
artifacts:
- name: results
path: /tmp/output
oss:
key: "results/results-{{workflow.creationTimestamp}}"
container:
image: registry-vpc.cn-beijing.aliyuncs.com/acs/busybox:1.33.1
command: [sh, -c]
args:
- |
echo "开始合并结果..."
TOTAL_TIME=0
SHARD_COUNT=0
for time_file in /tmp/output/time_shard_*.txt; do
TIME=$(cat $time_file)
SHARD_ID=${time_file##*_}
SHARD_ID=${SHARD_ID%.txt}
echo "分片 $SHARD_ID: $TIME 秒"
TOTAL_TIME=$((TOTAL_TIME + TIME))
SHARD_COUNT=$((SHARD_COUNT + 1))
done
if [ $SHARD_COUNT -gt 0 ]; then
AVERAGE=$((TOTAL_TIME / SHARD_COUNT))
echo "--------------------------------"
echo "总分片数量: $SHARD_COUNT"
echo "总处理时间: $TOTAL_TIME 秒"
echo "平均处理时间: $AVERAGE 秒/分片"
echo "Average: $AVERAGE seconds" > /tmp/output/time_stats.txt
else
echo "错误:未找到分片时间数据"
exit 1
fi
outputs:
artifacts:
- name: test-file
path: /tmp/output/time_stats.txt
oss:
key: results/results-{{workflow.creationTimestamp}}/time_stats.txt
archive: {}The workflow completed in about 21 seconds per shard, yielding an average of 21 seconds for the four ImageNet sub‑directories.
Alibaba Cloud also provides an open‑source implementation based on the containerd/overlaybd project, which can be combined with OCI image volumes for read‑only data mounts; see the KubeCon Europe 2025 talk for details.
In summary, StrmVol offers a lightweight, kernel‑direct storage solution that dramatically improves read latency for massive small‑file, read‑only workloads on OSS, with simple CSI deployment, configurable resource limits, and proven performance gains demonstrated via Argo Workflows.
Alibaba Cloud Infrastructure
For uninterrupted computing services
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.