Cloud Native 17 min read

Practical Experience of Deploying and Optimizing Apache Pulsar on Kubernetes at 360

This article shares the architecture design, cluster deployment, storage selection, multi‑region mode, service discovery, performance tuning, monitoring, alerting, and future plans of a production‑grade Apache Pulsar platform running on Kubernetes, providing valuable insights for engineers adopting Pulsar.

360 Tech Engineering

May 10, 2024

Practical Experience of Deploying and Optimizing Apache Pulsar on Kubernetes at 360

1. Architecture Design

1.1 Cluster Deployment

The Pulsar platform runs robustly on a Kubernetes cluster to maximize resource utilization, improve flexibility, and simplify operations. Local adaptations include integrating the company log‑management system, unifying monitoring and alerting, using open‑local PVC plugin for storage, consolidating permission management, and tightly coupling with the internal load‑balancer for high availability.

1. Log Management Integration: Automated log collection feeds the company big‑data analytics platform for log analysis, business insight, and intelligent alerts.

2. Monitoring & Alerting Integration: Automatic registration with the internal monitoring system provides real‑time visibility of Pulsar cluster health.

3. Storage Adaptation & Optimization: Open‑local PVC plugin is used to align with corporate storage policies, enhancing flexibility and performance.

4. Permission Management Integration: Unified access control simplifies user management and improves security.

5. Load‑Balancing Integration: Integration with the corporate load‑balancer ensures stable, low‑latency message delivery across the organization.

Future work includes exploring Pulsar Operator for finer‑grained automation.

1.2 Disk Selection

Benchmarking shows NVMe disks achieve sub‑8 ms end‑to‑end latency at 1 GB/s, while standard SSDs stay around 40 ms. The team adopts SSD local disks as the default storage and provides NVMe for latency‑critical services, while testing a heterogeneous mix of HDD and NVMe to balance cost and performance.

Ledger data, which is written asynchronously, will be stored on high‑capacity HDDs to reduce cost without hurting write throughput, whereas Journal logs and RocksDB indexes reside on fast NVMe drives.

1.3 Cluster Modes

Multiple deployment modes are offered—single‑node, multi‑zone active‑active, and multi‑region active‑active—to meet varying stability and flexibility requirements. In active‑active mode, clusters are fully meshed to synchronize messages and consumption progress in real time, providing near‑zero‑downtime failover.

Testing revealed a risk of duplicate consumption when producing and consuming across two clusters simultaneously; the recommended practice is “single‑cluster produce‑consume” to maintain data consistency.

2. Pulsar Service Discovery

Broker pods run in Kubernetes while client applications may run on bare metal, VMs, or other clusters. To expose brokers externally, the pod IP is captured via an environment variable and set as advertisedAddress in the broker configuration, which registers the IP in ZooKeeper.

Because pod IPs are unstable, an internal LoadBalancer proxies all brokers, and a single domain name is used to hide underlying IP changes, simplifying future data‑center migrations.

env:
  - name: POD_IP
    valueFrom:
      fieldRef:
        apiVersion: v1
        fieldPath: status.podIP

Clients resolve the domain name to obtain broker addresses, then connect directly to the broker pods for publishing and consuming.

3. Performance Optimization

3.1 Challenges

Using OpenMessagingBenchmark (OMB) the team identified high latency in journal creation, publishing, and consumption (P999 reaching seconds), as well as uneven traffic between Bookie and Broker nodes.

3.2 Solutions

3.2.1 Benchmark Tool Optimization

Adjust OMB version to match the Pulsar client, add TTL and retention policies to clean test data, customize persistence parameters, fine‑tune batching delays, and ensure the tool exits cleanly to avoid memory leaks.

3.2.2 Pulsar Optimizations

Build comprehensive Grafana monitoring, deeply understand broker and Bookie data flows, separate Zookeeper, ledger, and journal directories onto SSD/NVMe, enable journalSyncData, limit the number of journal directories, and adjust parameters such as journalPageCacheFlushIntervalMSec, numJournalCallbackThreads, and I/O thread counts.

Storage tuning includes setting appropriate EnsembleSize, WriteQuorum, AckQuorum, and configuring RocksDB write cache, buffer sizes, and SST file sizes to reduce write amplification.

3.2.3 Read Optimizations

Increase write cache size to improve cache hits, adjust dbStorage_readAheadCacheBatchSize for pre‑loading, and tune RocksDB block size for better index caching.

3.2.4 Optimization Results

After tuning, the cluster achieved up to 2 GB/s peak throughput, P99 publish latency ≤ 5 ms at 1 GB/s sustained load, average end‑to‑end latency ≤ 3.2 ms (P99 ≤ 8 ms), and stable 1 GB/s catch‑up reads under 1.8 GB/s tail‑read scenarios.

4. Monitoring & Alerting

Pulsar’s built‑in monitoring covers JVM, broker, Bookie, ZooKeeper, topic, and core K8s metrics. When Grafana consumes data from the K8s Prometheus exporter, chart adjustments are required due to format differences.

A multi‑layer alerting system monitors hardware resources, K8s cluster and node health, pod status, message E2E latency (using a mock client), and Pulsar‑specific metrics such as Bookie read/write status.

Log data is also shipped to the big‑data platform for deeper analysis, reducing operational workload and accelerating automation.

5. Future Plans

Upcoming work includes S3 integration for long‑term low‑cost storage, heterogeneous storage strategies combining SATA HDDs with NVMe SSDs, refined load‑balancing algorithms, automatic elastic scaling, and maturing the Pulsar Operator for easier K8s management. Integration with core platforms such as big‑data processing, log management, and function compute will further expand Pulsar’s ecosystem.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Cloud Native Performance Optimization Kubernetes Apache Pulsar Distributed Messaging

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.