Operations 12 min read

Taming High Cardinality in AI & Autonomous Driving with Prometheus

This article shares practical experience from Volcengine's managed Prometheus service and its deep integration with large‑model and autonomous‑driving platforms, explaining what high cardinality is, its impact on monitoring systems, root causes, and a range of design, collection, and analysis techniques to mitigate it.

ByteDance Cloud Native
ByteDance Cloud Native
ByteDance Cloud Native
Taming High Cardinality in AI & Autonomous Driving with Prometheus

What Is Cardinality

Cardinality describes the number of distinct label combinations that a metric can have. For example, the metric

http_request_total{cluster="cluster1",service="service1",endpoint="endpoint1",method="GET",resp_code="200"}

can explode in combinations when many clusters, services, endpoints, methods, and response codes exist.

Impact of High Cardinality

Increased monitoring costs : CPU, memory, and storage usage rise because each unique series creates indexes and caches.

Slower read/write latency : Index creation delays writes, and larger result sets slow reads.

Quota exhaustion : Multi‑tenant systems may hit per‑tenant metric quotas faster.

Common Causes

Large numbers of Prometheus targets.

Each target exposing many time series.

High churn labels such as user IDs or URLs.

High Cardinality in Large‑Model and Autonomous‑Driving Domains

In these domains, the most frequent high‑cardinality label is pod name , because training jobs create and destroy thousands of pods quickly. Additionally, large‑model platforms expose many “access points” that act like micro‑services, further inflating metric counts.

Typical Solutions

Solutions span metric design, collection, and analysis.

Metric Design

Reasonable modeling : Use metrics for high‑level state and rely on logs or traces for detailed per‑entity data. Metric decomposition : Split labels into separate metrics instead of aggregating everything. Metric lifecycle : Align metric lifespan with the underlying resource and delete stale series.

Metric Switches

Exporters often provide switches to disable unneeded metrics (e.g., node‑exporter).

Write‑Side Discard

When exporters lack switches, apply

relabel_config

rules to drop unwanted series before ingestion.

Pre‑aggregation

Volcengine’s VMP can pre‑aggregate metrics (e.g., aggregating by node pool instead of individual pods) before storage, reducing series count.

Aggregation Queries

Use PromQL aggregation operators that can be pushed down to storage nodes, minimizing data transfer.

Query Sharding and RemoteRead

Distribute data across multiple Prometheus instances and use RemoteRead or VMP’s query‑pushdown to combine results efficiently.

Write‑Side Pre‑aggregation

VMP’s collector can shard writes across workspaces, and its query engine can perform distributed aggregation, handling high‑cardinality queries without full data scans.

Conclusion

By applying these practices—proper metric modeling, selective collection, pre‑aggregation, and distributed querying—organizations can manage high cardinality in AI large‑model and autonomous‑driving workloads while keeping monitoring performance and cost under control.

monitoringAIObservabilityPrometheusautonomous drivinghigh cardinality
ByteDance Cloud Native
Written by

ByteDance Cloud Native

Sharing ByteDance's cloud-native technologies, technical practices, and developer events.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.