Big Data 19 min read

Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization

The vivo big‑data team details how they migrated massive real‑time workloads from Kafka to Apache Pulsar, describing cluster‑level bundle and ledger management, retention policies, a Prometheus‑Kafka‑Druid monitoring pipeline, load‑balancing tweaks, client tuning, rapid broker‑failure recovery, and future cloud‑native tracing and migration plans.

vivo Internet Technology
vivo Internet Technology
vivo Internet Technology
Exploring and Practicing Apache Pulsar at vivo: Cluster Management, Monitoring, and Optimization

This article is based on a talk given by the vivo Internet Big Data team at an Apache Pulsar meetup, titled "Apache Pulsar in vivo: Exploration and Practice". It describes how vivo applies Pulsar for cluster management and monitoring in a production environment serving over 400 million smartphone users.

vivo’s distributed messaging middleware team provides high‑throughput, low‑latency data ingestion and queuing services for real‑time computing workloads such as app store, short video, and advertising. The daily data volume reaches the hundred‑trillion‑level scale.

System Architecture

The overall architecture consists of a data‑ingress layer (SDK direct connections), a messaging layer where Kafka and Pulsar coexist (Pulsar handles traffic at the trillion‑level), and a processing layer using Flink, Spark, etc. Kafka is deployed in multiple clusters (billing, search, log) with physical isolation of topics per business importance.

Why Pulsar?

Facing rapid traffic growth, Kafka showed limitations: increasing topic/partition count, difficulty in dynamic scaling, high metadata growth, and high hardware failure impact. Pulsar’s compute‑storage separation, stateless brokers, and BookKeeper storage address these issues, offering independent scaling of compute and storage.

Pulsar provides four subscription models (Exclusive, Failover, Shared, Key_Shared) and supports massive partition counts, real‑time load balancing, rapid scaling, fault tolerance, tiered storage, and cloud‑native deployment.

Cluster Management Practices

1. Bundle Management

Bundles control traffic distribution to brokers. Topics are hashed to bundles, and bundle metadata is stored in ZooKeeper. Key points:

The number of bundles affects load‑balancing granularity; more bundles give finer control.

Bundle metadata must be planned to avoid excessive ZooKeeper load.

Operations include unloading bundles (instead of whole namespaces) and splitting large bundles.

loadBalancerLoadSheddingStrategy=org.apache.pulsar.Broker.loadbalance.impl.ThresholdShedder
loadBalancerAutoBundleSplitEnabled=false
loadBalancerAutoUnloadSplitBundlesEnabled=false
loadBalancerCPUResourceWeight=0.0
loadBalancerMemoryResourceWeight=0.0
loadBalancerDirectMemoryResourceWeight=0.0

2. Ledger Rotation

Each topic partition writes to a Ledger for a limited time. Rotation triggers when any of the following conditions is met:

Minimum rotation time elapsed.

Maximum rotation time elapsed.

Maximum number of entries per ledger.

Maximum ledger size.

managedLedgerMinLedgerRolloverTimeMinutes=10
managedLedgerMaxLedgerRolloverTimeMinutes=240
managedLedgerMaxEntriesPerLedger=50000
managedLedgerMaxSizePerLedgerMbytes=2048

Proper tuning prevents oversized ledgers and uneven disk usage.

3. Data Retention & Deletion

Four stages are described: un‑acked messages, acked messages (with TTL), retention policy, and physical deletion via BookKeeper’s GC. Configuration examples:

minorCompactionInterval=3600
minorCompactionThreshold=0.2
majorCompactionInterval=86400
majorCompactionThreshold=0.8

Best practice: align TTL with retention period to simplify management.

Monitoring Stack

The monitoring architecture uses Prometheus to scrape Pulsar metrics, forwards formatted metrics to Kafka via Prometheus remote‑write, and then Druid consumes Kafka data as a Grafana data source. Key metric categories include client‑side metrics, broker‑side metrics (topic traffic, broker load), and bookie‑side metrics (read/write latency). Additional custom metrics cover bundle distribution, P95/P99 broker latency, and network load per request.

Load‑Balancing Optimizations

Case study: 1 topic, 30 partitions, 180 bundles caused high traffic imbalance and frequent bundle unloads. By increasing partitions to 120, traffic became more evenly distributed, unload frequency dropped from >200 per day to ~10, and client connection churn reduced.

Client Configuration & Performance

Important client parameters:

memoryLimitBytes=209715200

maxPendingMessages=2000

maxPendingMessagesAcrossPartitions=40000

batchingMaxPublishDelayMicros=50

batchingMaxMessages=2000

batchingMaxBytes=5242880

After increasing partitions, the effective maxPendingMessages per partition dropped (40000/120 ≈ 333), causing a noticeable throughput reduction until the client was restarted. Reducing batchingMaxMessages further improved performance by up to tenfold.

Handling Broker Failures

When a broker crashes, other brokers experience traffic drop due to blocked producer threads. Optimizations include moving the blocking point from ProducerImpl to PartitionedProducerImpl, maintaining separate lists of usable and unusable producers, and quickly re‑routing traffic to healthy brokers. This approach restored traffic flow rapidly after a kill‑9 event.

Future Outlook

vivo plans to build end‑to‑end tracing from producers to consumers, integrate big‑data components (Flink, Spark, Druid) more tightly, migrate remaining Kafka traffic to Pulsar via KoP, and adopt container‑native deployments to fully leverage Pulsar’s cloud‑native capabilities.

monitoringBig DataLoad BalancingApache PulsarCluster Management
vivo Internet Technology
Written by

vivo Internet Technology

Sharing practical vivo Internet technology insights and salon events, plus the latest industry news and hot conferences.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.