Databases 26 min read

Evaluating the Use of mmap in Prometheus TSDB: Advantages, Disadvantages, and Performance Implications

This article examines mmap's historical origins, its performance benefits and drawbacks, and analyzes how Prometheus' time‑series database employs memory‑mapped files, revealing why mmap does not degrade Prometheus performance despite known kernel‑level issues such as TLB misses and lock contention.

Aikesheng Open Source Community
Aikesheng Open Source Community
Aikesheng Open Source Community
Evaluating the Use of mmap in Prometheus TSDB: Advantages, Disadvantages, and Performance Implications

Recently I read Andrew Pavlo's article warning developers against using mmap as a replacement for buffered I/O (pread/pwrite) in DBMSes. The argument sparked a deep dive into how Prometheus' TSDB uses mmap and whether this design choice is problematic.

1. Advantages and Disadvantages of mmap

How mmap originated

In the late 1980s, when RAM was scarce, SunOS 4.0 introduced file mapping so the kernel could load library files once and share them across processes, avoiding duplicate loads.

Advantages of mmap

1. Avoid system calls

Mapping a file requires a single system call mmap(2) , whereas each pread/pwrite operation incurs a context switch between user and kernel space, which can be costly due to TLB flushes.

#include <unistd.h>
ssize_t read(int fd, void *buf, size_t count);

Modern CPUs use tagged TLB entries, reducing the penalty of context switches, but the reduction is still noticeable in I/O‑intensive workloads.

2. No user‑space buffer

Traditional I/O copies data from kernel page cache to a user‑space buffer (e.g., buf in read(2) ). mmap eliminates this copy, which can improve throughput for large sequential reads.

type MmapFile struct {
    f *os.File
    b []byte // memory‑mapped data
}

func (f *MmapFile) Bytes() []byte { return f.b }

func mmap(f *os.File, length int) ([]byte, error) {
    return unix.Mmap(int(f.Fd()), 0, length, unix.PROT_READ, unix.MAP_SHARED)
}

Prometheus maps its TSDB files read‑only, avoiding the need for a writable user buffer.

3. Ease of use

Prometheus accesses mapped files as ordinary byte slices. Example from prometheus/tsdb/index/index.go :

// code that shows how Bytes() may be used
hash := crc32.Checksum(w.symbolFile.Bytes()[w.toc.Symbols+4:hashPos], castagnoliTable)

Python can also use mmap directly:

import mmap
with open('example.txt', 'r') as f:
    mmapped_file = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
    print(mmapped_file[:10])

Disadvantages of mmap

When multiple processes share a mapped file, consistency must be handled via copy‑on‑write or demand paging, which can be complex for DBMS workloads.

Linux uses three operations: mmap , munmap , and page faults. The mmap_lock (formerly mmap_sem ) protects the mm_struct and VMA list, becoming a contention point under heavy mapping activity.

struct mm_struct {
    ...
    struct rw_semaphore mmap_lock; /* memory area semaphore */
    ...
}

In multithreaded programs, the lock is often taken as a write lock, causing significant slowdown.

1. Page‑table contention

Only one mm_struct exists per process, so many threads contend on mmap_lock , which serialises VMA modifications.

2. Single‑threaded page eviction

Linux's kswapd thread evicts pages from the cache. Because it is single‑threaded, heavy mmap usage can be throttled by eviction latency.

3. TLB shoot‑downs

Changing page tables may require TLB invalidation across CPUs. Modern CPUs support tagged TLB entries, but shoot‑downs still occur on munmap or when write locks are taken.

2. How Prometheus Stores Data

Memory‑mapped contents

Running cat /proc/<pid>/maps on a machine with the umon component shows many read‑only shared mappings under /opt/umon/prometheus-data . Each line corresponds to a vm_area_struct describing start/end addresses, file, offset, and permissions.

Prometheus TSDB model

The TSDB stores series as time‑ordered samples. Data is first written to an in‑memory head chunk (≈32 KB), then flushed to a write‑ahead log (WAL) and eventually compacted into immutable block files.

Series_A -> (t0,A0), (t1,A1), ...
Series_B -> (t0,B0), (t1,B1), ...
Series_C -> (t0,C0), (t1,C1), ...

Because each chunk holds at most 120 samples (≈30 minutes of data) and the head can hold up to 3 hours, Prometheus keeps at most six head chunks, with only one actively writable.

WAL and checkpoints

Before persisting data, Prometheus records operations in a WAL. Periodically a checkpoint is created, after which obsolete WAL segments are truncated.

// compactable returns whether the head has a compactable range.
func (h *Head) compactable() bool {
    return h.MaxTime()-h.MinTime() > h.chunkRange.Load()/2*3
}
// truncateWAL removes old data before mint from the WAL.
func (h *Head) truncateWAL(mint int64) error {
    first, last, err := wlog.Segments(h.wal.Dir())
    // ...
    last = first + (last-first)*2/3
    if last <= first { return nil }
    // ...
}

Checkpointing retains roughly the last two‑thirds of WAL segments, ensuring that only recent data remains for recovery.

3. Conclusion

mmap is an old mechanism designed for small, single‑process memory footprints. Modern hardware mitigates many of its classic penalties, but page‑table contention, single‑threaded eviction, and TLB shoot‑downs can still make raw mmap I/O slower than direct I/O in benchmarks.

Prometheus' TSDB, however, buffers writes in an in‑memory head chunk and only uses mmap for read‑only block files, so the mmap‑related drawbacks highlighted by Pavlo do not materially affect Prometheus performance.

Contrary to some claims, VictoriaMetrics also relies on mmap for file I/O, confirming that mmap remains a viable strategy for many time‑series databases.

PerformanceDatabaselinuxPrometheusMMAPTSDB
Aikesheng Open Source Community
Written by

Aikesheng Open Source Community

The Aikesheng Open Source Community provides stable, enterprise‑grade MySQL open‑source tools and services, releases a premium open‑source component each year (1024), and continuously operates and maintains them.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.