Operations 24 min read

How Pinterest Scaled Its Monitoring, Logging, and Tracing Over Seven Years

This article chronicles Pinterest's seven‑year evolution from a single‑machine time‑series monitor to a multi‑component system that integrates metrics, log search, and distributed tracing, sharing architectural choices, scaling challenges, and lessons learned for building reliable, high‑performance operations platforms.

Efficient Ops
Efficient Ops
Efficient Ops
How Pinterest Scaled Its Monitoring, Logging, and Tracing Over Seven Years
<code>Different business scenarios demand different operational systems; Pinterest, a Silicon Valley startup, has continuously refined and upgraded its monitoring stack, now delivering an integrated solution for alerts, log search, and distributed tracing.</code>

Introduction

Pinterest is a Silicon Valley startup that helps users discover creative ideas. Since its launch in 2010, monthly active users have grown to 190 million and its cloud platform now runs on tens of thousands of virtual machines.

Like the company’s growth, its monitoring platform has evolved from a single‑machine, time‑series monitor to a suite of subsystems capable of handling millions of data points per second. The following sections describe the seven‑year evolution of this system and share practical experience and lessons.

1. Pinterest Company

Founded in 2010, Pinterest is the world’s largest image‑exploration engine, distinct from search engines like Baidu or Google. Users share ideas as images, which are collected into billions of pins and saved in billions of collections, generating roughly 2 billion searches per month.

Unlike many social networks, Pinterest emphasizes exploration over social interaction, resulting in steady, sustainable growth.

1.1 Backend Architecture

The backend runs on more than 30 000 Amazon EC2 instances, hosting over 100 internally‑developed micro‑services and custom data‑storage platforms for search, content, and advertising.

1.2 Operations

Pinterest employs a dedicated SRE team of about a dozen engineers, plus embedded SREs within each product group. The goal is a 99.9 % availability target (no more than 43 minutes of downtime per month), which requires close collaboration between operations and development teams.

2. Monitoring System Composition and Evolution

The monitoring stack consists of three core tools:

Time‑Series Metrics and Alerting : collects and visualizes system‑wide performance indicators and triggers alerts.

Log Search : indexes and searches log data.

Distributed Tracing : records end‑to‑end request flows across micro‑services, capturing timing and rich context.

Initially, a few SRE engineers built these tools. As user volume grew, Pinterest adopted open‑source solutions: Ganglia, then Graphite (which struggled with scaling), followed by OpenTSDB in 2014, a custom log‑search platform in 2015, and a home‑grown tracing system in 2016.

3. Metrics, Log Search, and Tracing

3.1 Time‑Series Metrics and Alerts

Data from thousands of VMs and applications is published to a Kafka‑based pub/sub system, then consumed, sampled, and stored in a disk‑based database.

The system currently processes about 250 data points per second and serves roughly 35 k queries per second, 90 % of which originate from the alerting subsystem.

3.1.1 Use Case: Data Visualization

Engineers use a front‑end tool to define expressions and view metric trends over selected time ranges.

3.1.2 Use Case: Dashboards

Custom dashboards aggregate multiple charts for a holistic view of related metrics.

3.1.3 Use Case: Alerting

The alerting system lets users replay historical data against a threshold to fine‑tune alert conditions and integrates with JIRA for incident management.

3.2 Pain Points, Challenges, and Solutions

Massive Data Volume : ~100 TB of data generated daily. Solution: early dimensionality reduction, sampling, and tiered storage for cold data.

Reliability : Dual‑monitoring systems monitor each other; active probing (heartbeat packets) complements passive checks.

Query Latency : Average chart load time 0.5–5 seconds. Solution: data sharding by type (e.g., Java vs. Python) and by recency (hot data on SSD, older data on HDD).

Sharding by data type sends different language streams to separate clusters; sharding by time keeps recent data on SSD‑based clusters with replication, while older data resides on HDD clusters.

We are also prototyping an in‑memory time‑series database that keeps the most recent 24 hours of data entirely in RAM, delivering order‑of‑magnitude query speedups.

3.3 Log Search

3.3.1 Log Search Tool

Logs from processes are streamed via Kafka, indexed, and made searchable through UI tools. Pinterest uses both Sumo Logic and Elasticsearch, handling 500–800 GB of logs per day.

Users can define alerting rules that trigger when log patterns match.

3.3.2 Log Standardization

Each programming language uses a standard API that enriches logs with file, line, process ID, request ID, and user context, outputting JSON for easy downstream processing.

3.4 Distributed Tracing System

3.4.1 Microservice Architecture

With over a hundred micro‑services, manual tracing is infeasible. The tracing system automatically records inter‑service calls, timestamps, and rich context.

3.4.2 Call Graphs

The collected data forms a waterfall‑style graph that shows the sequence, timing, and context of service calls.

3.4.3 Implementation

Libraries for Python and Java capture trace data, which is sent via Kafka to an Elasticsearch backend. The data supports performance analysis, bottleneck identification, and per‑service CPU usage estimation.

3.4.4 One‑Stop Dashboard

Pinterest is building a unified dashboard that merges tracing results with metric data, providing a holistic view from front‑end servers to backend services, with visual cues (e.g., red for services below 99.9 % success).

4. Lessons Learned and Future Directions

4.1 Challenges and Countermeasures

4.1.1 Unpredictable Read/Write Patterns

Variable usage spikes make capacity planning difficult. Pinterest mitigates this by designing for rapid scaling and favoring redundancy to prioritize reliability.

4.1.2 90 % of Data Never Read

Most generated metrics are never queried, inflating cost. The team educates engineers on data volume and encourages pruning unnecessary instrumentation.

4.1.3 Adoption of New Tools

To improve usability, Pinterest invests in polished UI components (e.g., third‑party charting libraries) that make tools more approachable.

4.2 The Three Core Tools

4.2.1 Time‑Series Metrics

Low‑cost, high‑frequency counters provide a global health view but lack request‑level context.

4.2.2 Log Search

Rich, event‑level data with full context, but expensive to index at scale.

4.2.3 Distributed Tracing

End‑to‑end request flow with precise timing; sampling limited to ~0.1 % to control overhead.

4.3 Future Work

4.3.1 Integration

Combine metrics, logs, and traces into a single interface, allowing engineers to correlate data by request ID, service, container, or VM without switching tools.

4.3.2 Intelligence

Apply machine‑learning techniques to de‑duplicate alerts, route incidents to the right on‑call team, and automatically surface root‑cause hypotheses.

monitoringOperationsSREDistributed Tracinglog searchtime-series metrics
Efficient Ops
Written by

Efficient Ops

This public account is maintained by Xiaotianguo and friends, regularly publishing widely-read original technical articles. We focus on operations transformation and accompany you throughout your operations career, growing together happily.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.