Optimizing BERT Online Service Deployment at 360 Search
This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.
Deploying a deep, parameter‑heavy BERT model as an online service for 360 Search faces severe latency and throughput challenges.
Background
In the 360 Search scenario, online BERT services must meet extremely low latency and high throughput requirements. Three main difficulties are identified: massive model parameters (over 100 M for a 12‑layer BERT), long inference time (≈200 ms on CPU, ≈80 ms on unoptimized GPU), and high computational resource demand (hundreds of GPUs needed for full traffic).
After evaluating several inference frameworks (TF‑Serving, ONNX Runtime, TorchJIT, TensorRT) on criteria such as quantization support, preprocessing needs, variable‑length handling, stability, performance, and community activity, Nvidia’s open‑source TensorRT was chosen for further optimization.
Framework‑Level Optimizations
Layer and tensor fusion to reduce kernel calls and improve GPU utilization.
Automatic kernel tuning that selects optimal layers and parallel algorithms for the target GPU.
Multi‑stream execution that shares weights across streams to increase concurrency and memory efficiency.
Dynamic tensor memory allocation, allocating GPU memory only when tensors are needed.
Model quantization, which boosts throughput and reduces latency while preserving accuracy.
Knowledge Distillation
The 12‑layer BERT model was distilled into a lightweight 6‑layer version, achieving 99 % of the original accuracy while halving computational cost; the 6‑layer model’s TP99 latency is roughly half that of the 12‑layer model.
FP16 Quantization
By converting most tensors from FP32 to FP16 during inference (no back‑propagation needed), latency dropped to one‑third and throughput tripled, with memory usage halved; the minor precision loss at the ten‑thousandth level had negligible impact on search quality.
Stream Optimization
Profiling revealed that GPU utilization plateaued at ~80 % because data transfer (H2D/D2H) left the GPU idle. Adding an extra stream allowed overlapping kernel execution between streams, raising utilization above 98 %.
The diagram explains that H2D copies data to GPU memory, the kernel performs inference, and D2H copies results back; only the kernel phase utilizes the GPU.
Runtime Architecture
The architecture shows a single BERT service process using two GPUs, each loading a model engine compiled by TensorRT. Requests are queued, worker threads fetch tasks, copy inputs via streams, invoke kernels, and return results, sharing model weights across streams.
Cache Optimization
Introducing request‑level caching for hot search terms achieved a 35 % cache‑hit rate, significantly reducing the load on the online BERT service.
Dynamic Sequence Length
Instead of a fixed maximum sequence length, the service now pads inputs to the longest sequence in each batch, improving performance by about 7 %.
BERT Online Service Exploration
Dynamic Model Loading Increases Latency
When hot‑loading new model versions, weight transfer over the PCI bus temporarily blocks input data transfers, causing a brief TP99 latency spike (a few seconds), though most requests remain unaffected.
Precision Oscillation
Inference results varied slightly (0.93–0.95) across different batch sizes due to a TensorRT 7.1.3.4 bug; the issue was resolved in TensorRT 7.2.2.3.
GPU Memory Consumption
Each model occupies a few hundred MB of GPU memory; loading 5–8 versions can risk OOM. Memory usage stems from model weights, runtime context (inputs/outputs and intermediate tensors), and fixed CUDA runtime overhead.
Summary and Outlook
After extensive framework evaluation, model and engineering optimizations, and deployment experiments, the optimized 6‑layer BERT model runs at 1500 queries per second on a single T4 GPU, with a peak‑time TP99 latency of 13 ms. Stability and performance are now solid, delivering noticeable business gains.
Future work includes containerizing the service on Kubernetes for easier scaling and resilience, and integrating training, distillation, data, model management, and deployment into an internal ML platform to streamline the end‑to‑end workflow.
360 Smart Cloud
Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.