12 min read

Optimizing BERT Online Service Deployment at 360 Search

This article describes the challenges of deploying a large BERT model as an online service for 360 Search and details engineering optimizations—including framework selection, model quantization, knowledge distillation, stream scheduling, caching, and dynamic sequence handling—that dramatically improve latency, throughput, and resource utilization.

360 Smart Cloud

Mar 4, 2021

Optimizing BERT Online Service Deployment at 360 Search

Deploying a deep, parameter‑heavy BERT model as an online service for 360 Search faces severe latency and throughput challenges.

Background

In the 360 Search scenario, online BERT services must meet extremely low latency and high throughput requirements. Three main difficulties are identified: massive model parameters (over 100 M for a 12‑layer BERT), long inference time (≈200 ms on CPU, ≈80 ms on unoptimized GPU), and high computational resource demand (hundreds of GPUs needed for full traffic).

After evaluating several inference frameworks (TF‑Serving, ONNX Runtime, TorchJIT, TensorRT) on criteria such as quantization support, preprocessing needs, variable‑length handling, stability, performance, and community activity, Nvidia’s open‑source TensorRT was chosen for further optimization.

Framework‑Level Optimizations

Layer and tensor fusion to reduce kernel calls and improve GPU utilization.

Automatic kernel tuning that selects optimal layers and parallel algorithms for the target GPU.

Multi‑stream execution that shares weights across streams to increase concurrency and memory efficiency.

Dynamic tensor memory allocation, allocating GPU memory only when tensors are needed.

Model quantization, which boosts throughput and reduces latency while preserving accuracy.

Knowledge Distillation

The 12‑layer BERT model was distilled into a lightweight 6‑layer version, achieving 99 % of the original accuracy while halving computational cost; the 6‑layer model’s TP99 latency is roughly half that of the 12‑layer model.

FP16 Quantization

By converting most tensors from FP32 to FP16 during inference (no back‑propagation needed), latency dropped to one‑third and throughput tripled, with memory usage halved; the minor precision loss at the ten‑thousandth level had negligible impact on search quality.

Stream Optimization

Profiling revealed that GPU utilization plateaued at ~80 % because data transfer (H2D/D2H) left the GPU idle. Adding an extra stream allowed overlapping kernel execution between streams, raising utilization above 98 %.

The diagram explains that H2D copies data to GPU memory, the kernel performs inference, and D2H copies results back; only the kernel phase utilizes the GPU.

Runtime Architecture

The architecture shows a single BERT service process using two GPUs, each loading a model engine compiled by TensorRT. Requests are queued, worker threads fetch tasks, copy inputs via streams, invoke kernels, and return results, sharing model weights across streams.

Cache Optimization

Introducing request‑level caching for hot search terms achieved a 35 % cache‑hit rate, significantly reducing the load on the online BERT service.

Dynamic Sequence Length

Instead of a fixed maximum sequence length, the service now pads inputs to the longest sequence in each batch, improving performance by about 7 %.

BERT Online Service Exploration

Dynamic Model Loading Increases Latency

When hot‑loading new model versions, weight transfer over the PCI bus temporarily blocks input data transfers, causing a brief TP99 latency spike (a few seconds), though most requests remain unaffected.

Precision Oscillation

Inference results varied slightly (0.93–0.95) across different batch sizes due to a TensorRT 7.1.3.4 bug; the issue was resolved in TensorRT 7.2.2.3.

GPU Memory Consumption

Each model occupies a few hundred MB of GPU memory; loading 5–8 versions can risk OOM. Memory usage stems from model weights, runtime context (inputs/outputs and intermediate tensors), and fixed CUDA runtime overhead.

Summary and Outlook

After extensive framework evaluation, model and engineering optimizations, and deployment experiments, the optimized 6‑layer BERT model runs at 1500 queries per second on a single T4 GPU, with a peak‑time TP99 latency of 13 ms. Stability and performance are now solid, delivering noticeable business gains.

Future work includes containerizing the service on Kubernetes for easier scaling and resilience, and integrating training, distillation, data, model management, and deployment into an internal ML platform to streamline the end‑to‑end workflow.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

TensorRT Knowledge Distillation BERT GPU optimization online inference Model Serving FP16 quantization

Written by

360 Smart Cloud

Official service account of 360 Smart Cloud, dedicated to building a high-quality, secure, highly available, convenient, and stable one‑stop cloud service platform.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.