Accelerating TensorFlow Model Inference with NVIDIA TensorRT: Methods, Experiments, and Results
This article explains how to use NVIDIA TensorRT to accelerate TensorFlow model inference by describing TensorRT architecture, optimization techniques such as layer fusion and precision calibration, detailing the conversion of frozen_graph and saved_model formats, presenting experimental setup and performance comparisons, and summarizing the achieved speed‑up.
Deep learning inference speed directly determines the applicability of models in low‑latency scenarios such as video and speech processing, making model acceleration on existing hardware a common challenge.
TensorRT Overview
TensorRT is a high‑performance inference platform consisting of an optimizer and a low‑latency runtime, supporting major frameworks like TensorFlow, Caffe, and PyTorch. Since its introduction in 2017, it has evolved through several versions, with the latest being TensorRT 5.1.5.
Optimization Techniques
Layer and tensor fusion to reduce kernel launches and memory bandwidth bottlenecks.
Weight and activation precision calibration (FP16/INT8) to shrink model size and improve throughput.
Kernel auto‑tuning based on GPU, input size, filter dimensions, and tensor layout.
Dynamic tensor memory allocation to reuse buffers and lower allocation overhead.
TensorFlow Model Formats
TensorFlow stores training checkpoints (ckpt) and frozen graphs (pb). A frozen graph merges the graph definition with constant weights, while a saved_model adds input‑output signatures and is used by TensorFlow Serving.
tf‑trt Experiments
Test Environment
Experiments were run on an NVIDIA Tesla V100 (16 GB) using Docker images tensorflow/serving:latest-gpu and nvcr.io/nvidia/tensorrt:19.02-py3 . The model was an Inception‑V4 classifier trained on a custom dataset.
Frozen Graph Generation and Conversion
python models/research/slim/export_inference_graph.py \
--model_name=inception_v4 \
--dataset_name=my_imagenet \
--output_file=v4/my_v4_infer_graph.pb bazel build tensorflow/python/tools:freeze_graph && \
bazel-bin/tensorflow/python/tools/freeze_graph \
--input_graph=v4/my_v4_infer_graph.pb \
--input_checkpoint=ckpt/model.ckpt-1200000 \
--output_graph=v4/frozen_graph.pb \
--output_node_names=InceptionV4/Logits/Predictions # encoding: utf-8
'''uff module needs matching TensorRT version'''
from __future__ import division, print_function
import uff
try:
import tensorrt as trt
except:
import tensorrt.legacy as trt
from tensorrt.parsers import uffparser
MAX_WORKSPACE = 1 << 30
G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.WARNING)
CHANNEL = 3
INPUT_W = 299
INPUT_H = 299
MAX_BATCHSIZE = 1
def main():
tf_freeze_model = 'v4/frozen_graph.pb'
input_node = 'input'
out_node = 'InceptionV4/Logits/Predictions'
uff_model = uff.from_tensorflow_frozen_model(tf_freeze_model, [out_node])
parser = uffparser.create_uff_parser()
parser.register_input(input_node, (CHANNEL, INPUT_H, INPUT_W), 0)
parser.register_output(out_node)
engine = trt.utils.uff_to_trt_engine(G_LOGGER, uff_model, parser, MAX_BATCHSIZE, MAX_WORKSPACE)
trt.utils.write_engine_to_file('v4/v4_tensorrt.engine', engine.serialize())
if __name__ == "__main__":
main()Saved Model Conversion and Acceleration
python tensorflow_serving/example/inception_saved_model.py \
--checkpoint_dir=ckpt \
--image_size=299 \
--model_version=1 \
--label_num=3200 \
--label_dir=ckpt/label_synset.txt \
--output_dir=v4/ docker run --rm --runtime=tf-trt -it -v /data:/tmp tensorflow/tensorflow:latest-gpu \
/usr/local/bin/saved_model_cli convert \
--dir /tmp/v4/1 \
--output_dir /tmp/cvt_trt \
--tag_set serve \
tensorRT --precision_mode FP32 \
--max_batch_size 1 --is_dynamic_op TrueInference Time Comparison
Four configurations were benchmarked on 3,941 images, measuring average inference latency after warm‑up:
Model
Environment
Precision
Time (ms)
Speed‑up
Frozen Graph
python3.5 + TF 1.14.0
FP32
17.09
baseline
TensorRT‑optimized Frozen Graph
python3.5 + TensorRT 5.0.2
FP32
13.54
20.77 %
Saved Model (TF‑Serving)
TF‑Serving + TF 1.14.0 + RESTful
FP32
23.64
baseline
TensorRT‑optimized Saved Model
TF‑Serving + TF 1.14.0 + RESTful
FP32
18.57
21.45 %
The results show that TensorRT conversion improves inference speed by roughly 20 % for both frozen graphs and saved models, with TensorRT‑optimized saved models still slower than optimized frozen graphs due to additional preprocessing and serving overhead.
Conclusion
The article systematically reviews TensorRT concepts, details the conversion process for TensorFlow models, and demonstrates measurable latency reductions, leading the authors’ image‑recognition service to adopt TensorRT‑optimized frozen graphs for production.
HomeTech
HomeTech tech sharing
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.