Operations 16 min read

Migration from Thanos to VictoriaMetrics: Architecture, Plan, Issues, and Benefits

This article details the end‑to‑end migration from Thanos to VictoriaMetrics, covering background analysis, architectural comparison, a phased migration plan, encountered configuration and performance issues, resolution strategies, and the resulting performance, cost, and scalability improvements for the monitoring system.

Soul Technical Team
Soul Technical Team
Soul Technical Team
Migration from Thanos to VictoriaMetrics: Architecture, Plan, Issues, and Benefits

1. Background Introduction

Previous research compared Thanos and VictoriaMetrics across resource consumption, query latency, and maintenance cost, showing VictoriaMetrics superior in all key metrics, leading to the decision to migrate the entire monitoring stack to VictoriaMetrics in Q4.

2. Architecture Overview

The comparison splits Thanos and VictoriaMetrics into three layers:

Storage Layer: Provides time‑series storage and high‑availability read/write separation.

Collection Layer: vmagent is fully compatible with Prometheus and offers better performance for large‑scale deployments.

Alerting Layer: vmalert improves alert persistence and component restart handling while remaining compatible with Prometheus alert rules.

3. Migration Plan

1. Preparation Phase

Requirement analysis: define performance, resource, and reliability goals.

Environment setup: build a VictoriaMetrics cluster in a test environment for validation.

2. Testing Phase

Functional testing: verify all features of VictoriaMetrics meet existing requirements.

Performance testing: compare query response time and resource usage against Thanos.

Compatibility testing: ensure custom configs and scripts work after migration.

3. Switch Phase

Step‑by‑step rollout: start with a small pilot, monitor, and adjust.

Gradual expansion: extend the migration based on pilot feedback.

Real‑time monitoring: continuously observe both old and new systems during the switch.

Fast rollback: keep historical data intact for immediate rollback without user impact.

4. Monitoring and Optimization

Performance tuning: adjust VictoriaMetrics parameters based on observed metrics.

User feedback: collect and act on feedback, especially for large queries that may hit default limits.

4. Migration Process

1. Cluster Splitting

Large monolithic clusters cause performance bottlenecks, high maintenance cost, and low resource utilization; splitting into multiple smaller VictoriaMetrics clusters improves manageability and performance.

2. Scaling Strategy

Vertical scaling : increase CPU/memory of a node. Horizontal scaling : add more vmstorage , vminsert , and vmselect nodes via service discovery.

Automatic scaling : deploy vmagent on Kubernetes and use HPA to scale based on CPU/memory metrics.

3. vmagent Migration

vmagent acts as a Prometheus‑compatible remote‑write proxy, improving data collection efficiency. Configuration uses the existing prometheus.yml file passed to vmagent.

global:
  scrape_interval: 15s
  evaluation_interval: 15s
scrape_configs:
- job_name: 'node_exporter'
  static_configs:
  - targets: ['localhost:9100']

Remote‑write endpoint is set to the VM cluster’s vmselect or vminsert address.

-remoteWrite.url=http://localhost:8481/api/v1/write
-promscrape.concurrentScrapers=10
-promscrape.scrapeTimeout=10s
-promscrape.relabelConfig.file=/etc/prometheus/relabel.yaml

Metric filtering and sample limits are configured via metric_relabel_configs and sample_limit to control ingestion volume.

metric_relabel_configs:
- action: keep
  regex: "^(metric_to_keep_1|metric_to_keep_2)$"
  source_labels: [__name__]
sample_limit: 1000

4. Grafana Migration

Grafana dashboards (>2500) use Thanos as a data source; the migration replaces the data source UID in batches, validates each dashboard, and ensures alert rules are correctly switched.

5. Alert Rule Migration

Thanos Ruler rules are exported from MySQL and imported into vmalert, placed under /etc/vmalert/rules/ , and validated with:

vmalert -ruleFilePath=/etc/vmalert/rules/example_rules.yaml -datasource.url=http://localhost:8481/api/v1/query -test

5. Issues Encountered During Migration

1. Component Configuration Problems

vmagent lacked alert rule configuration, causing startup failures; removed rule sections from Prometheus config.

Incompatible scrape_configs parameters (e.g., refresh_interval ) required adjustment in vmagent startup flags.

Remote‑write settings needed to be moved from config files to command‑line arguments.

2. Collection Metric Issues

Large metric sets exceeded default --promscrape.maxScrapeSize ; increased the limit.

Excessive label counts hit maxLabelsPerTimeseries ; raised --storage.maxLabelsPerTimeseries accordingly.

3. Query Metric Issues

Query body size exceeded --search.maxQueryLen ; increased the limit.

Memory exhaustion for massive result sets; mitigated by reducing time range, increasing step, or raising resource limits.

Number of matching time series exceeded default 300,000; adjusted vmselect limits or narrowed queries.

4. Dashboard Display Issues

Obsolete panels caused migration friction; batch‑updated dashboards and alerts via a custom tool to replace data source UIDs.

6. Post‑Migration Benefits

1. Performance Gains

VictoriaMetrics delivers ~50% faster query response times under the same hardware.

2. Operational Cost and Resource Consumption

Component restarts are quicker with minimal impact.

Horizontal scaling allows rapid resource addition.

Memory and CPU usage reduced by ~30% compared to Thanos.

3. Strong Scalability

Supports seamless horizontal expansion to handle future data growth.

Custom development on VM components enables hot scaling via Consul service discovery.

7. Conclusion

The migration from Thanos to VictoriaMetrics resolved performance bottlenecks, lowered operational costs, and improved scalability, providing a solid foundation for future monitoring enhancements and serving as a valuable reference for similar migrations.

MigrationMonitoringOperationsVictoriaMetricstime seriesThanos
Soul Technical Team
Written by

Soul Technical Team

Technical practice sharing from Soul

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.