Fundamentals 8 min read

Reading Notes on OSDI20 Best Paper hXDP: Efficient Software Packet Processing on FPGA NICs

This article reviews the OSDI20 best paper on hXDP, explaining why combining XDP with FPGA NICs can alleviate CPU bottlenecks, describing the challenges of eBPF offload to low‑frequency FPGA, and summarizing the custom instruction‑set optimizations and test results that achieve comparable performance to multi‑GHz CPUs.

Cloud Native Technology Community
Cloud Native Technology Community
Cloud Native Technology Community
Reading Notes on OSDI20 Best Paper hXDP: Efficient Software Packet Processing on FPGA NICs

This article is a reading note of the OSDI20 best paper "hXDP: Efficient Software Packet Processing on FPGA NICs". Interested readers can click the original paper, slides, and video.

Why combine the two?

The motivation is to address the limitations of XDP running on CPUs—CPU bottlenecks due to Moore's law slowdown and high‑speed NIC bandwidth—and the high programming barrier of FPGA hardware description languages. Offloading XDP to FPGA reduces latency, saves CPU cycles, and leverages FPGA's parallelism.

Challenges

Although an eBPF‑compatible IP core could be implemented on FPGA, the FPGA’s low clock frequency (150 MHz vs. 2–3 GHz CPUs) creates performance issues: eBPF’s sequential instruction set does not parallelize well, lower frequency increases latency, and limited FPGA resources can hurt overall throughput.

Solutions

The paper proposes two main optimization directions, focusing here on extending the instruction set:

Delete zero‑initialization operations, which can be guaranteed by hardware.

Replace two‑operand eBPF instructions with three‑operand forms to reduce instruction count (e.g., combine r4=r2; r4+=14 into a single r4=r2+14 ).

Remove array bounds checks, delegating safety to hardware.

Add 6‑byte store/load instructions to better match MAC header size.

Parameterize return actions, e.g., collapse r0=1; exit; into a single exit_drop instruction.

Test Results

The authors evaluated several XDP examples, including a simple firewall and Facebook’s Katran, comparing FPGA implementation against CPUs at 1.2 GHz, 2.1 GHz, and 3.7 GHz. Key findings:

Instruction‑set optimizations reduce instruction count by roughly 40%.

FPGA latency is about one‑tenth of CPU latency due to reduced data‑transfer overhead.

Throughput on FPGA matches a 2.1 GHz single‑core CPU for XDP examples.

For firewall and Katran tests, FPGA throughput falls between 2.1 GHz and 3.7 GHz CPU performance.

Personal Thoughts

The custom instruction‑set approach offers a novel perspective for software developers and could inspire optimizations for eBPF JIT on x86 or ASIC NICs.

Although the FPGA + XDP combo reaches CPU‑level performance, it does not surpass pure FPGA solutions that achieve several times higher throughput, limiting its attractiveness.

Given eBPF’s general‑purpose design, extending it with network‑specific instructions might also benefit x86 implementations.

eBPFnetwork accelerationXDPFPGAInstruction Set Optimization
Cloud Native Technology Community
Written by

Cloud Native Technology Community

The Cloud Native Technology Community, part of the CNBPA Cloud Native Technology Practice Alliance, focuses on evangelizing cutting‑edge cloud‑native technologies and practical implementations. It shares in‑depth content, case studies, and event/meetup information on containers, Kubernetes, DevOps, Service Mesh, and other cloud‑native tech, along with updates from the CNBPA alliance.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.