Backend Development 13 min read

JDDLB Architecture and QAT SSL/TLS Hardware Acceleration Optimization

This article details the overall architecture of JD.com Data Science's JDDLB load balancer, its high‑performance and high‑availability features, and presents a comprehensive performance comparison of SSL/TLS offloading using Intel QAT acceleration cards, including async processing, user‑space driver zero‑copy implementation, crash analysis, and process‑level engine scheduling.

JD Tech Talk
JD Tech Talk
JD Tech Talk
JDDLB Architecture and QAT SSL/TLS Hardware Acceleration Optimization

JD.com Data Science's JDDLB serves as the primary public‑traffic entry point, replacing commercial F5 devices and supporting massive concurrent connections with a four‑layer SLB, custom NGINX extensions, and a unified management platform.

Key capabilities: high performance (millions of concurrent connections), high availability (ECMP, session sync, health checks), extensibility (layer‑4/7 clustering, horizontal scaling, gray release), layer‑4 load features (VIP announcement via OSPF, multiple balancing algorithms, FullNAT), layer‑7 load features (domain/URL routing, various algorithms), SSL/TLS management (certificates, SNI, hardware offload), traffic control (SYN‑Flood, WAF, QoS, ACL), and comprehensive monitoring/alerting.

The article focuses on accelerating the layer‑7 SSL/TLS path by offloading CPU‑intensive cryptographic operations to Intel QAT cards. Test methodology involved deploying NGINX with QAT support, generating traffic until CPU reached 100 %, and measuring new‑connection rates.

Performance results: with a single QAT card, RSA handshake rates increased ~3× and ECDH ~2.5×; with dual cards, RSA improved ~6× and ECDH ~5.5× compared to pure software encryption.

Performance gains stem from zero‑copy user‑space driver (UIO + mmap) eliminating kernel‑to‑user copies, asynchronous OpenSSL integration, and support for multiple concurrent QAT cards.

Asynchronous framework: NGINX’s native epoll loop is extended to handle async file descriptors from the QAT engine. The flow includes ASYNC_start_job, RSA/ECDH offload, QAT driver posting async events, pausing the job, epoll monitoring, and waking the job once hardware completes.

The user‑space driver uses UIO to map device registers into user space and employs a custom usdm kernel module for page‑table management, enabling true zero‑copy between kernel, user, and hardware memory.

Crash analysis revealed that mixing two memory allocation methods (large‑page pre‑allocation and ioctl‑based 4 KB pages) caused mismatched CSR physical addresses, leading to kernel crashes during message‑queue release. The fix involved using a single allocation strategy per process.

Process‑level engine scheduling initializes QAT per worker process, registers the application via /dev/qat_dev_processs, creates service instances, and shares the engine across forked workers.

The QAT component stack consists of the Application layer (async patches and QAT engine), SAL (service access layer) providing crypto/compression services, and ADF (acceleration driver framework) including intel_qat.ko, 8950pci driver, and usdm.

In conclusion, JDDLB now supports both Freescale and Intel QAT hardware acceleration, offering a financially‑grade, secure, and high‑performance gateway solution with future plans for QUIC, MQTT, and broader protocol support.

Performance Optimizationload balancingnginxHardware OffloadQATSSL/TLS accelerationuser-space driver
JD Tech Talk
Written by

JD Tech Talk

Official JD Tech public account delivering best practices and technology innovation.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.