Artificial Intelligence 10 min read

Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search

This article details the engineering challenges of serving a large BERT model in real‑time for 360 Search and describes a series of optimizations—including TensorRT‑based kernel fusion, model quantization, knowledge distillation, multi‑stream execution, caching, and dynamic sequence handling—that together achieve low latency, high throughput, and stable deployment on GPU clusters.

360 Tech Engineering

Mar 1, 2021

Deploying BERT as an Online Service: Challenges and Optimizations at 360 Search

The 360 Search team faced significant latency and throughput challenges when deploying a deep, parameter‑heavy 12‑layer BERT model as an online service, with raw inference times of ~200 ms on CPU and ~80 ms on GPU.

After evaluating several inference frameworks (TF‑Serving, ONNX Runtime, TorchJIT, TensorRT), they selected Nvidia's open‑source TensorRT for its support of quantization, variable‑length inputs, and strong performance.

Framework‑Level Optimizations

TensorRT provides layer‑fusion, tensor‑fusion, automatic kernel tuning, multi‑stream execution, dynamic memory allocation, and model quantization, all of which improve GPU utilization and reduce inference latency.

Knowledge Distillation

The original 12‑layer model was distilled to a 6‑layer lightweight version, preserving 99 % of the original accuracy while halving inference time and doubling the TP99 performance.

FP16 Quantization

Switching most tensors from FP32 to FP16 reduced inference latency to one‑third and tripled throughput, with only negligible accuracy loss in the search scenario.

Pipeline (Stream) Optimization

Profiling revealed that GPU memory copy phases left the GPU idle, capping utilization at ~80 %. Adding a second stream allowed overlapping of copy and compute phases, raising utilization above 98 %.

Runtime Architecture

The service runs on two GPU cards, each loading a TensorRT‑compiled engine. Multiple streams share model weights, and a thread pool pulls requests from a task queue, copies data via streams, executes kernels, and returns results.

Cache Optimization

Request‑level caching of hot query terms achieved a 35 % cache hit rate, significantly reducing compute load.

Dynamic Sequence Length

By padding batches to the longest sequence in the batch rather than a fixed maximum, they reduced unnecessary computation and improved performance by about 7 %.

Operational Issues Discovered

Dynamic model loading caused brief PCI‑bus contention, temporarily raising TP99 latency; precision oscillations were observed with certain batch sizes in TensorRT 7.1.3.4 but fixed in later releases; and loading multiple model versions risked OOM, mitigated by pre‑checking memory usage.

Conclusion and Future Work

After the optimizations, the 6‑layer BERT model on a single T4 GPU processes ~1500 queries per second with a 13 ms TP99 latency. Future work includes migrating the service to Kubernetes for better scalability and integrating training, distillation, and deployment pipelines into an internal ML platform.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

model optimization TensorRT GPU Knowledge Distillation BERT online inference

Written by

360 Tech Engineering

Official tech channel of 360, building the most professional technology aggregation platform for the brand.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.