Artificial Intelligence 13 min read

Accelerating TensorFlow Model Inference with NVIDIA TensorRT: Methods, Experiments, and Results

This article explains how to use NVIDIA TensorRT to accelerate TensorFlow model inference by describing TensorRT architecture, optimization techniques such as layer fusion and precision calibration, detailing the conversion of frozen_graph and saved_model formats, presenting experimental setup and performance comparisons, and summarizing the achieved speed‑up.

HomeTech

Sep 4, 2019

Accelerating TensorFlow Model Inference with NVIDIA TensorRT: Methods, Experiments, and Results

Deep learning inference speed directly determines the applicability of models in low‑latency scenarios such as video and speech processing, making model acceleration on existing hardware a common challenge.

TensorRT Overview

TensorRT is a high‑performance inference platform consisting of an optimizer and a low‑latency runtime, supporting major frameworks like TensorFlow, Caffe, and PyTorch. Since its introduction in 2017, it has evolved through several versions, with the latest being TensorRT 5.1.5.

Optimization Techniques

Layer and tensor fusion to reduce kernel launches and memory bandwidth bottlenecks.

Weight and activation precision calibration (FP16/INT8) to shrink model size and improve throughput.

Kernel auto‑tuning based on GPU, input size, filter dimensions, and tensor layout.

Dynamic tensor memory allocation to reuse buffers and lower allocation overhead.

TensorFlow Model Formats

TensorFlow stores training checkpoints (ckpt) and frozen graphs (pb). A frozen graph merges the graph definition with constant weights, while a saved_model adds input‑output signatures and is used by TensorFlow Serving.

tf‑trt Experiments

Test Environment

Experiments were run on an NVIDIA Tesla V100 (16 GB) using Docker images tensorflow/serving:latest-gpu and nvcr.io/nvidia/tensorrt:19.02-py3. The model was an Inception‑V4 classifier trained on a custom dataset.

Frozen Graph Generation and Conversion

python models/research/slim/export_inference_graph.py \
    --model_name=inception_v4 \
    --dataset_name=my_imagenet \
    --output_file=v4/my_v4_infer_graph.pb

bazel build tensorflow/python/tools:freeze_graph && \
    bazel-bin/tensorflow/python/tools/freeze_graph \
    --input_graph=v4/my_v4_infer_graph.pb \
    --input_checkpoint=ckpt/model.ckpt-1200000 \
    --output_graph=v4/frozen_graph.pb \
    --output_node_names=InceptionV4/Logits/Predictions

# encoding: utf-8
'''uff module needs matching TensorRT version'''
from __future__ import division, print_function
import uff
try:
    import tensorrt as trt
except:
    import tensorrt.legacy as trt
from tensorrt.parsers import uffparser
MAX_WORKSPACE = 1 << 30
G_LOGGER = trt.infer.ConsoleLogger(trt.infer.LogSeverity.WARNING)
CHANNEL = 3
INPUT_W = 299
INPUT_H = 299
MAX_BATCHSIZE = 1

def main():
    tf_freeze_model = 'v4/frozen_graph.pb'
    input_node = 'input'
    out_node = 'InceptionV4/Logits/Predictions'
    uff_model = uff.from_tensorflow_frozen_model(tf_freeze_model, [out_node])
    parser = uffparser.create_uff_parser()
    parser.register_input(input_node, (CHANNEL, INPUT_H, INPUT_W), 0)
    parser.register_output(out_node)
    engine = trt.utils.uff_to_trt_engine(G_LOGGER, uff_model, parser, MAX_BATCHSIZE, MAX_WORKSPACE)
    trt.utils.write_engine_to_file('v4/v4_tensorrt.engine', engine.serialize())

if __name__ == "__main__":
    main()

Saved Model Conversion and Acceleration

python tensorflow_serving/example/inception_saved_model.py \
    --checkpoint_dir=ckpt \
    --image_size=299 \
    --model_version=1 \
    --label_num=3200 \
    --label_dir=ckpt/label_synset.txt \
    --output_dir=v4/

docker run --rm --runtime=tf-trt -it -v /data:/tmp tensorflow/tensorflow:latest-gpu \
    /usr/local/bin/saved_model_cli convert \
    --dir /tmp/v4/1 \
    --output_dir /tmp/cvt_trt \
    --tag_set serve \
    tensorRT --precision_mode FP32 \
    --max_batch_size 1 --is_dynamic_op True

Inference Time Comparison

Four configurations were benchmarked on 3,941 images, measuring average inference latency after warm‑up:

Model

Environment

Precision

Time (ms)

Speed‑up

Frozen Graph

python3.5 + TF 1.14.0

FP32

17.09

baseline

TensorRT‑optimized Frozen Graph

python3.5 + TensorRT 5.0.2

FP32

13.54

20.77 %

Saved Model (TF‑Serving)

TF‑Serving + TF 1.14.0 + RESTful

FP32

23.64

baseline

TensorRT‑optimized Saved Model

TF‑Serving + TF 1.14.0 + RESTful

FP32

18.57

21.45 %

The results show that TensorRT conversion improves inference speed by roughly 20 % for both frozen graphs and saved models, with TensorRT‑optimized saved models still slower than optimized frozen graphs due to additional preprocessing and serving overhead.

Conclusion

The article systematically reviews TensorRT concepts, details the conversion process for TensorFlow models, and demonstrates measurable latency reductions, leading the authors’ image‑recognition service to adopt TensorRT‑optimized frozen graphs for production.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

model optimization deep learning Inference Acceleration TensorRT TensorFlow

Written by

HomeTech

HomeTech tech sharing

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.