Backend Development 20 min read

Introduction to Apache Pulsar Architecture and Features

This article introduces Apache Pulsar, an open‑source cloud‑native distributed messaging platform, detailing its storage‑compute separation architecture, multi‑tenant support, load balancing, fault tolerance, schema handling, functions, IO connectors, tiered storage, cross‑region replication, and operational commands for managing brokers, bookies, and namespaces.

360 Smart Cloud

Sep 13, 2023

Introduction to Apache Pulsar Architecture and Features

Apache Pulsar is an open‑source cloud‑native distributed messaging platform that separates storage and compute, integrating messaging, storage, and lightweight function execution. It supports multi‑tenant isolation, high throughput, low latency, and strong consistency, making it suitable for large‑scale, cross‑region data streams.

The Pulsar architecture differs from traditional brokers by delegating message routing to stateless brokers while persisting data on dedicated BookKeeper nodes. ZooKeeper stores metadata, and brokers act as proxies for clients, simplifying read/write flows and ensuring load balancing.

Multi‑tenant capabilities are achieved through a hierarchical naming scheme ( /property/namespace/topic) where a property represents a tenant, a namespace groups topics, and ACLs enforce security and resource quotas. Tenants can be isolated both logically and physically using broker and bookie affinity groups.

Security features include multiple authentication mechanisms (mTLS, Athenz, Kerberos, JWT, OAuth2, OpenID Connect, HTTP basic) and end‑to‑end encryption with symmetric (AES) and asymmetric key handling, ensuring data confidentiality during transmission.

Rate limiting can be configured at tenant and namespace levels, controlling message flow, message count, and producer/consumer limits. Configuration can be updated dynamically without service interruption.

Load balancing is performed by a LoadManager on each broker, which reports resource usage to the metadata store and triggers bundle splitting or rebalancing based on policies such as OverloadShedder, ThresholdShedder, UniformLoadShedder, and TransferShedder.

Fault tolerance is achieved through rapid failover: when a broker fails, the leader broker reassigns topics to healthy brokers without moving data, and bookie failures trigger ledger recreation on new nodes, enabling millisecond‑level recovery.

Transactional messaging guarantees atomicity of a group of produce/consume actions, ensuring all-or-nothing commits, exactly‑once delivery across partitions, and consistent acknowledgments.

Pulsar Schema automates serialization/deserialization for primitive, complex (KeyValue), and POJO/AVRO/JSON/ProtoBuf types, with versioning and compatibility checks to handle schema evolution.

Pulsar Functions provide lightweight stream processing with at‑most‑once, at‑least‑once, and effectively‑once semantics, supporting ThreadRuntime, ProcessRuntime, and KubernetesRuntime execution environments.

Pulsar IO (Connectors) extends data ingestion and egress to external systems. Built‑in sources include Kafka, MySQL (Canal, Debezium), PostgreSQL, MongoDB, and many others; built‑in sinks support databases, search engines, and storage services. Users can also implement custom sources and sinks.

Tiered storage (Offloader) allows hot data to remain on local BookKeeper while archiving cold data to cloud storage services such as Amazon S3, Google Cloud Storage, Azure Blob, or Aliyun OSS, reducing storage costs without sacrificing latency for recent data.

Cross‑region replication provides four modes—Full Mesh, Active‑Standby, Aggregation, and Geo‑Replication—enabling data redundancy and high availability across data centers. Replicators use cursors and producers to copy topics between clusters.

Operational commands for managing isolation policies, bookie affinity groups, and namespace configurations are executed via the Pulsar admin CLI, for example:

bin/pulsar-admin ns-isolation-policy set \
  --auto-failover-policy-type min_available \
  --auto-failover-policy-params min_limit=1,usage_threshold=80 \
  --namespaces my-tenant/my-namespace \
  --primary 10.111.16.* \
  --secondary 10.111.17.* my-cluster policy-name

Example producer code using schema:

Producer<User> producer = client.newProducer(JSONSchema.of(User.class))
    .topic(topic)
    .create();
producer.send(user);

Integrations include monitoring with Prometheus/Grafana, logging via Flume/Log4j/Logstash, Spark streaming, Kafka compatibility via the KOP client, and CDC connectors for Oracle, MySQL, and MongoDB.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Distributed Systems load balancing replication Message Queue schema multi-tenancy Apache Pulsar

Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.