Artificial Intelligence 14 min read

PetPS: A Persistent‑Memory Parameter Server for Large‑Scale Embedding Models

PetPS introduces a persistent‑memory‑based parameter server that redesigns indexing with the PetHash hash table and offloads parameter aggregation to NIC Gathering, achieving up to 1.7× higher throughput and significantly lower latency for industrial‑scale embedding models in recommendation, search, and advertising workloads.

Kuaishou Tech
Kuaishou Tech
Kuaishou Tech
PetPS: A Persistent‑Memory Parameter Server for Large‑Scale Embedding Models

Embedding models are widely used in industry for recommendation, advertising, and search, converting high‑dimensional sparse ID features into low‑dimensional dense vectors via large embedding tables. As model sizes grow to the trillion‑parameter level, traditional DRAM‑based parameter servers become costly and suffer long recovery times.

To address these challenges, the authors collected trace data from Kuaishou's online inference service and identified three key load characteristics: read‑intensive access, stable capacity load, and batch processing of thousands of IDs per request.

The proposed system, PetPS, builds on these insights with two main innovations: (1) PetHash , a persistent‑memory‑optimized hash index featuring a single‑layer structure, hotspot‑aware migration, and prefetching to minimize PM reads; and (2) NIC Gathering , which offloads the aggregation of embedding parameters to the network interface card using scatter‑gather DMA, reducing CPU involvement.

PetHash employs a single‑layer bucket layout with open addressing, storing metadata such as fingerprints, version numbers, and overflow counters. A dedicated migration thread moves hot key‑value pairs to their home buckets, while a prefetch mechanism issues fetch instructions for the next bucket during batch processing, effectively hiding PM latency.

NIC Gathering leverages the NIC's scatter‑gather DMA capability to collect parameters directly from PM, eliminating costly CPU reads and cache misses. The system ensures DMA safety using copy‑on‑write and an epoch‑list reclamation scheme.

Experimental evaluation on Intel Optane DC PM and real production workloads from Kuaishou shows that PetPS achieves 1.3‑1.7× higher peak throughput and reduces median and P99 latencies by up to 5× compared to baseline parameter servers (PSLite, DashPS, KuaiPS). PetHash improves index throughput by 1.3‑2.5×, and NIC Gathering cuts aggregation time from 180 µs to 14 µs, yielding up to 1.2× end‑to‑end throughput gains.

In summary, PetPS is the first industry‑grade persistent‑memory parameter server, demonstrating that tailored indexing and NIC‑offloaded aggregation can effectively mitigate PM read latency and CPU bottlenecks for massive embedding models, while also reducing hardware cost by about 30% without sacrificing performance.

Performance Optimizationsystem designrecommendation systemsparameter serverEmbedding ModelsPersistent Memory
Kuaishou Tech
Written by

Kuaishou Tech

Official Kuaishou tech account, providing real-time updates on the latest Kuaishou technology practices.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.