Taming High Cardinality in AI & Autonomous Driving with Prometheus
This article shares practical experience from Volcengine's managed Prometheus service and its deep integration with large‑model and autonomous‑driving platforms, explaining what high cardinality is, its impact on monitoring systems, root causes, and a range of design, collection, and analysis techniques to mitigate it.
What Is Cardinality
Cardinality describes the number of distinct label combinations that a metric can have. For example, the metric
http_request_total{cluster="cluster1",service="service1",endpoint="endpoint1",method="GET",resp_code="200"}can explode in combinations when many clusters, services, endpoints, methods, and response codes exist.
Impact of High Cardinality
Increased monitoring costs : CPU, memory, and storage usage rise because each unique series creates indexes and caches.
Slower read/write latency : Index creation delays writes, and larger result sets slow reads.
Quota exhaustion : Multi‑tenant systems may hit per‑tenant metric quotas faster.
Common Causes
Large numbers of Prometheus targets.
Each target exposing many time series.
High churn labels such as user IDs or URLs.
High Cardinality in Large‑Model and Autonomous‑Driving Domains
In these domains, the most frequent high‑cardinality label is pod name , because training jobs create and destroy thousands of pods quickly. Additionally, large‑model platforms expose many “access points” that act like micro‑services, further inflating metric counts.
Typical Solutions
Solutions span metric design, collection, and analysis.
Metric Design
Reasonable modeling : Use metrics for high‑level state and rely on logs or traces for detailed per‑entity data. Metric decomposition : Split labels into separate metrics instead of aggregating everything. Metric lifecycle : Align metric lifespan with the underlying resource and delete stale series.
Metric Switches
Exporters often provide switches to disable unneeded metrics (e.g., node‑exporter).
Write‑Side Discard
When exporters lack switches, apply
relabel_configrules to drop unwanted series before ingestion.
Pre‑aggregation
Volcengine’s VMP can pre‑aggregate metrics (e.g., aggregating by node pool instead of individual pods) before storage, reducing series count.
Aggregation Queries
Use PromQL aggregation operators that can be pushed down to storage nodes, minimizing data transfer.
Query Sharding and RemoteRead
Distribute data across multiple Prometheus instances and use RemoteRead or VMP’s query‑pushdown to combine results efficiently.
Write‑Side Pre‑aggregation
VMP’s collector can shard writes across workspaces, and its query engine can perform distributed aggregation, handling high‑cardinality queries without full data scans.
Conclusion
By applying these practices—proper metric modeling, selective collection, pre‑aggregation, and distributed querying—organizations can manage high cardinality in AI large‑model and autonomous‑driving workloads while keeping monitoring performance and cost under control.
ByteDance Cloud Native
Sharing ByteDance's cloud-native technologies, technical practices, and developer events.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.