Artificial Intelligence 17 min read

ZTFace: A High‑Precision, Fast Face Recognition Algorithm

This article presents ZTFace, an end‑to‑end face recognition solution that integrates face detection, alignment, feature embedding, verification, anti‑spoofing and attribute recognition using deep learning, details its backbone networks, loss functions, training datasets, experimental results on WIDER FACE and LFW, and demonstrates acceleration with TensorRT.

Zhengtong Technical Team
Zhengtong Technical Team
Zhengtong Technical Team
ZTFace: A High‑Precision, Fast Face Recognition Algorithm

Background – Face recognition is a fundamental yet challenging problem in computer vision, especially when deployed in real‑world commercial scenarios. Historically, large Chinese companies such as Baidu, Face++ and ArcSoft have dominated the market. The ZTFace algorithm breaks the reliance on third‑party services, offering flexible, high‑performance detection and recognition.

Technical Overview – A complete face‑recognition pipeline typically includes the following modules: face detection, face alignment, feature embedding, face verification, face recognition, anti‑spoofing (liveness detection) and attribute recognition (gender, age, ethnicity, expression, mask wearing, etc.). Deep learning advances have made these modules feasible for practical deployment.

Deep Learning Foundations – Convolutional Neural Networks (CNNs) are the backbone of image‑based tasks. Since the 2013 ImageNet breakthrough, deeper networks (10‑100+ layers) and frameworks such as TensorFlow and PyTorch have enabled rapid development. The article briefly explains the biological inspiration behind deep learning and illustrates the training of a CNN for cat‑dog classification as an example.

4.1 Face Detection – The detection model is built on a RetinaNet‑style anchor‑based detector. Anchors are designed with multiple scales but a fixed 1:1 aspect ratio to match typical face shapes. The network predicts three branches: (1) binary classification of each anchor, (2) regression of bounding‑box offsets (x, y, w, h), and (3) regression of facial key‑point offsets.

Backbone Network – Pre‑trained ImageNet models such as VGG16 or ResNet serve as feature extractors. Feature Pyramid Network (FPN) aggregates multi‑scale features, and a Context Module further fuses them to capture surrounding semantics.

Multi‑Task Loss – The overall loss combines three components: classification loss (binary softmax), bounding‑box regression loss (Smooth‑L1), and key‑point regression loss (Smooth‑L1). Weighting factors λ₁, λ₂, λ₃ balance the tasks.

Training and Results – Models were trained on the public WIDER FACE dataset (32,203 images, 393,703 faces) using three backbones: MobileNet, ResNet‑50 and ResNet‑152. Evaluation on the WIDER FACE test set shows:

Backbone

Easy

Medium

Hard

MobileNet

90.55%

88.08%

73.52%

ResNet50

94.28%

92.52%

80.45%

ResNet152

94.74%

93.35%

82.91%

Face Alignment – Alignment normalizes pose and scale by detecting facial landmarks (e.g., 5‑point, 68‑point). Detected landmarks are transformed to a set of reference points using an affine transformation. The reference points used are:

REFERENCE_FACIAL_POINTS = [
    [30.29459953+8,  51.69630051],
    [65.53179932+8,  51.50139999],
    [48.02519989+8,  71.73660278],
    [33.54930115+8,  92.3655014],
    [62.72990036+8,  92.20410156]
]

Feature Embedding – The core of recognition is mapping a normalized face image to a high‑dimensional vector (e.g., 128‑D or 512‑D). ZTFace uses a ResNet‑50 backbone with a 512‑D fully‑connected layer, trained with Triplet‑Loss to enforce intra‑class compactness and inter‑class separation. Softmax loss alone is insufficient for verification tasks.

Verification & Recognition – Verification computes the Euclidean distance between two 512‑D vectors; a distance below a threshold (e.g., 1.0) indicates the same identity. A sigmoid function converts distance to similarity. Recognition matches a probe vector against a gallery database.

Anti‑Spoofing – A dedicated liveness detector trained on >90,000 real and spoof samples achieves 99.8% accuracy, protecting against printed photos, video replays and masks.

Attribute Recognition – A multi‑task CNN simultaneously predicts six attributes (age, gender, ethnicity, makeup, expression, mask) using six dedicated heads and combined loss functions.

Algorithm Acceleration – TensorRT, NVIDIA’s high‑performance inference optimizer, was used to accelerate both detection and embedding models. Speed comparisons (Pytorch vs. TensorRT) show:

Backbone

Frame

#faces

Time

Model Size

mobilenet

Pytorch

897

453 ms

1.8 M

ResNet50

Pytorch

777

1158.2 ms

109.5 M

ResNet152

Pytorch

721

1278.8 ms

248.6 M

ResNet50

TensorRT‑FP32

766

65 ms

195.9 M

ResNet50

TensorRT‑FP16

766

19 ms

68.1 M

For the embedding model, TensorRT‑FP32 reduces inference from 31.9 ms (Pytorch) to 3 ms, while half‑precision was avoided to preserve accuracy.

Conclusion & Outlook – ZTFace delivers a highly accurate and fast face‑recognition pipeline suitable for attendance, mobile login, smart city management, construction site safety and in‑vehicle data collection. Future work will continue to improve robustness and expand application scenarios.

computer visiondeep learningTensorRTface recognitionZTFace
Zhengtong Technical Team
Written by

Zhengtong Technical Team

How do 700+ nationwide projects deliver quality service? What inspiring stories lie behind dozens of product lines? Where is the efficient solution for tens of thousands of customer needs each year? This is Zhengtong Digital's technical practice sharing—a bridge connecting engineers and customers!

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.