Artificial Intelligence 10 min read

Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search

This article details the engineering challenges of serving a large BERT model in real‑time for 360 Search and describes a series of optimizations—including TensorRT‑based kernel fusion, model quantization, knowledge distillation, multi‑stream execution, caching, and dynamic sequence handling—that together achieve low latency, high throughput, and stable deployment on GPU clusters.

360 Tech Engineering
360 Tech Engineering
360 Tech Engineering
Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search

The 360 Search team faced significant latency and throughput challenges when deploying a deep, parameter‑heavy 12‑layer BERT model as an online service, with raw inference times of ~200 ms on CPU and ~80 ms on GPU.

After evaluating several inference frameworks (TF‑Serving, ONNX Runtime, TorchJIT, TensorRT), they selected Nvidia's open‑source TensorRT for its support of quantization, variable‑length inputs, and strong performance.

Framework‑Level Optimizations

TensorRT provides layer‑fusion, tensor‑fusion, automatic kernel tuning, multi‑stream execution, dynamic memory allocation, and model quantization, all of which improve GPU utilization and reduce inference latency.

Knowledge Distillation

The original 12‑layer model was distilled to a 6‑layer lightweight version, preserving 99 % of the original accuracy while halving inference time and doubling the TP99 performance.

FP16 Quantization

Switching most tensors from FP32 to FP16 reduced inference latency to one‑third and tripled throughput, with only negligible accuracy loss in the search scenario.

Pipeline (Stream) Optimization

Profiling revealed that GPU memory copy phases left the GPU idle, capping utilization at ~80 %. Adding a second stream allowed overlapping of copy and compute phases, raising utilization above 98 %.

Runtime Architecture

The service runs on two GPU cards, each loading a TensorRT‑compiled engine. Multiple streams share model weights, and a thread pool pulls requests from a task queue, copies data via streams, executes kernels, and returns results.

Cache Optimization

Request‑level caching of hot query terms achieved a 35 % cache hit rate, significantly reducing compute load.

Dynamic Sequence Length

By padding batches to the longest sequence in the batch rather than a fixed maximum, they reduced unnecessary computation and improved performance by about 7 %.

Operational Issues Discovered

Dynamic model loading caused brief PCI‑bus contention, temporarily raising TP99 latency; precision oscillations were observed with certain batch sizes in TensorRT 7.1.3.4 but fixed in later releases; and loading multiple model versions risked OOM, mitigated by pre‑checking memory usage.

Conclusion and Future Work

After the optimizations, the 6‑layer BERT model on a single T4 GPU processes ~1500 queries per second with a 13 ms TP99 latency. Future work includes migrating the service to Kubernetes for better scalability and integrating training, distillation, and deployment pipelines into an internal ML platform.

model optimizationTensorRTGPUknowledge distillationBERTOnline Inference
360 Tech Engineering
Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.