Migration from Thanos to VictoriaMetrics: Architecture, Plan, Issues, and Benefits
This article details the end‑to‑end migration from Thanos to VictoriaMetrics, covering background analysis, architectural comparison, a phased migration plan, encountered configuration and performance issues, resolution strategies, and the resulting performance, cost, and scalability improvements for the monitoring system.
1. Background Introduction
Previous research compared Thanos and VictoriaMetrics across resource consumption, query latency, and maintenance cost, showing VictoriaMetrics superior in all key metrics, leading to the decision to migrate the entire monitoring stack to VictoriaMetrics in Q4.
2. Architecture Overview
The comparison splits Thanos and VictoriaMetrics into three layers:
Storage Layer: Provides time‑series storage and high‑availability read/write separation.
Collection Layer: vmagent is fully compatible with Prometheus and offers better performance for large‑scale deployments.
Alerting Layer: vmalert improves alert persistence and component restart handling while remaining compatible with Prometheus alert rules.
3. Migration Plan
1. Preparation Phase
Requirement analysis: define performance, resource, and reliability goals.
Environment setup: build a VictoriaMetrics cluster in a test environment for validation.
2. Testing Phase
Functional testing: verify all features of VictoriaMetrics meet existing requirements.
Performance testing: compare query response time and resource usage against Thanos.
Compatibility testing: ensure custom configs and scripts work after migration.
3. Switch Phase
Step‑by‑step rollout: start with a small pilot, monitor, and adjust.
Gradual expansion: extend the migration based on pilot feedback.
Real‑time monitoring: continuously observe both old and new systems during the switch.
Fast rollback: keep historical data intact for immediate rollback without user impact.
4. Monitoring and Optimization
Performance tuning: adjust VictoriaMetrics parameters based on observed metrics.
User feedback: collect and act on feedback, especially for large queries that may hit default limits.
4. Migration Process
1. Cluster Splitting
Large monolithic clusters cause performance bottlenecks, high maintenance cost, and low resource utilization; splitting into multiple smaller VictoriaMetrics clusters improves manageability and performance.
2. Scaling Strategy
Vertical scaling : increase CPU/memory of a node. Horizontal scaling : add more vmstorage , vminsert , and vmselect nodes via service discovery.
Automatic scaling : deploy vmagent on Kubernetes and use HPA to scale based on CPU/memory metrics.
3. vmagent Migration
vmagent acts as a Prometheus‑compatible remote‑write proxy, improving data collection efficiency. Configuration uses the existing prometheus.yml file passed to vmagent.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: 'node_exporter'
static_configs:
- targets: ['localhost:9100']Remote‑write endpoint is set to the VM cluster’s vmselect or vminsert address.
-remoteWrite.url=http://localhost:8481/api/v1/write
-promscrape.concurrentScrapers=10
-promscrape.scrapeTimeout=10s
-promscrape.relabelConfig.file=/etc/prometheus/relabel.yamlMetric filtering and sample limits are configured via metric_relabel_configs and sample_limit to control ingestion volume.
metric_relabel_configs:
- action: keep
regex: "^(metric_to_keep_1|metric_to_keep_2)$"
source_labels: [__name__]
sample_limit: 10004. Grafana Migration
Grafana dashboards (>2500) use Thanos as a data source; the migration replaces the data source UID in batches, validates each dashboard, and ensures alert rules are correctly switched.
5. Alert Rule Migration
Thanos Ruler rules are exported from MySQL and imported into vmalert, placed under /etc/vmalert/rules/ , and validated with:
vmalert -ruleFilePath=/etc/vmalert/rules/example_rules.yaml -datasource.url=http://localhost:8481/api/v1/query -test5. Issues Encountered During Migration
1. Component Configuration Problems
vmagent lacked alert rule configuration, causing startup failures; removed rule sections from Prometheus config.
Incompatible scrape_configs parameters (e.g., refresh_interval ) required adjustment in vmagent startup flags.
Remote‑write settings needed to be moved from config files to command‑line arguments.
2. Collection Metric Issues
Large metric sets exceeded default --promscrape.maxScrapeSize ; increased the limit.
Excessive label counts hit maxLabelsPerTimeseries ; raised --storage.maxLabelsPerTimeseries accordingly.
3. Query Metric Issues
Query body size exceeded --search.maxQueryLen ; increased the limit.
Memory exhaustion for massive result sets; mitigated by reducing time range, increasing step, or raising resource limits.
Number of matching time series exceeded default 300,000; adjusted vmselect limits or narrowed queries.
4. Dashboard Display Issues
Obsolete panels caused migration friction; batch‑updated dashboards and alerts via a custom tool to replace data source UIDs.
6. Post‑Migration Benefits
1. Performance Gains
VictoriaMetrics delivers ~50% faster query response times under the same hardware.
2. Operational Cost and Resource Consumption
Component restarts are quicker with minimal impact.
Horizontal scaling allows rapid resource addition.
Memory and CPU usage reduced by ~30% compared to Thanos.
3. Strong Scalability
Supports seamless horizontal expansion to handle future data growth.
Custom development on VM components enables hot scaling via Consul service discovery.
7. Conclusion
The migration from Thanos to VictoriaMetrics resolved performance bottlenecks, lowered operational costs, and improved scalability, providing a solid foundation for future monitoring enhancements and serving as a valuable reference for similar migrations.
Soul Technical Team
Technical practice sharing from Soul
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.