Exploring Observability in Cloud‑Native Architecture: Practices from Ant Financial
This article reviews Ant Financial's cloud‑native observability journey, covering its origins, the three pillars of tracing, metrics and logging, community projects like OpenTelemetry, practical implementations, sampling strategies, and future directions for unified microservice, mesh, and serverless monitoring.
This article is based on the SOFA Meetup #3 Guangzhou session titled “Ant Financial’s Exploration and Practice of Observability under Cloud‑Native Architecture,” and provides a written recap of the talk, video, and slides.
As applications shift toward cloud‑native architectures, traditional monitoring can no longer satisfy operational requirements, prompting the introduction of the observability paradigm.
The discussion covers the origin of observability in cloud‑native environments, its driving forces, its relationship to monitoring, the three foundational pillars (tracing, metrics, logging), community developments, current product status, and Ant Financial’s own understanding and practice.
Observability became a buzzword in the second half of 2017 when Cindy Sridharan wrote about it on Medium and Matt Stine later added it as one of six essential traits of cloud‑native architectures (modularity, observability, deployability, testability, disposability, replaceability).
From a control‑theory perspective, observability is the degree to which a system’s internal state can be inferred from its external outputs—known in cloud‑native contexts as telemetry, which consists of tracing, metrics, and logging.
Unlike traditional monitoring that focuses on infrastructure, observability describes application behavior and requires developers to embed telemetry libraries (e.g., OpenTracing, OpenCensus) during development, aligning with DevOps and SRE principles.
Community projects such as OpenCensus and OpenTracing converged into OpenTelemetry in 2019, providing a vendor‑neutral specification and language‑specific libraries to unify the three pillars, though logging support is still evolving.
Current open‑source and commercial products often address only one or two pillars, lacking a unified solution that simultaneously handles tracing, metrics, and logging, and they do not provide a single model for microservices, service mesh, and serverless workloads.
Ant Financial’s practice emphasizes trace‑centric observability, integrating tracing, metrics, and logging. An SDK allows users to configure loggers (e.g., log4j) to automatically include TraceId and RPCId in log entries, enabling correlation between traces and logs via a trace view.
Topology and metric correlation are also supported: when an application uploads a trace, associated metrics are linked to the call graph, allowing users to drill down from topology to metric details.
For sampling, Ant Financial adopts a tail‑based approach: all spans are kept in memory, and after a trace completes, the system evaluates whether the trace contains errors or slow spans. If so, the trace is permanently stored, ensuring that abnormal traces are always captured, unlike the common head‑based fixed‑rate sampling used in many open‑source tools.
Future work aims to create a unified model that can manage classic microservices, service mesh, and serverless workloads together, reflecting the growing complexity of hybrid cloud‑native environments.
Video recordings, slides, and reference links are provided for further reading.
For more details, see the referenced articles and the Ant Financial technology site.
AntTech
Technology is the core driver of Ant's future creation.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.