ZTFace: A High‑Precision, Fast Face Recognition Algorithm
This article presents ZTFace, an end‑to‑end face recognition solution that integrates face detection, alignment, feature embedding, verification, anti‑spoofing and attribute recognition using deep learning, details its backbone networks, loss functions, training datasets, experimental results on WIDER FACE and LFW, and demonstrates acceleration with TensorRT.
Background – Face recognition is a fundamental yet challenging problem in computer vision, especially when deployed in real‑world commercial scenarios. Historically, large Chinese companies such as Baidu, Face++ and ArcSoft have dominated the market. The ZTFace algorithm breaks the reliance on third‑party services, offering flexible, high‑performance detection and recognition.
Technical Overview – A complete face‑recognition pipeline typically includes the following modules: face detection, face alignment, feature embedding, face verification, face recognition, anti‑spoofing (liveness detection) and attribute recognition (gender, age, ethnicity, expression, mask wearing, etc.). Deep learning advances have made these modules feasible for practical deployment.
Deep Learning Foundations – Convolutional Neural Networks (CNNs) are the backbone of image‑based tasks. Since the 2013 ImageNet breakthrough, deeper networks (10‑100+ layers) and frameworks such as TensorFlow and PyTorch have enabled rapid development. The article briefly explains the biological inspiration behind deep learning and illustrates the training of a CNN for cat‑dog classification as an example.
4.1 Face Detection – The detection model is built on a RetinaNet‑style anchor‑based detector. Anchors are designed with multiple scales but a fixed 1:1 aspect ratio to match typical face shapes. The network predicts three branches: (1) binary classification of each anchor, (2) regression of bounding‑box offsets (x, y, w, h), and (3) regression of facial key‑point offsets.
Backbone Network – Pre‑trained ImageNet models such as VGG16 or ResNet serve as feature extractors. Feature Pyramid Network (FPN) aggregates multi‑scale features, and a Context Module further fuses them to capture surrounding semantics.
Multi‑Task Loss – The overall loss combines three components: classification loss (binary softmax), bounding‑box regression loss (Smooth‑L1), and key‑point regression loss (Smooth‑L1). Weighting factors λ₁, λ₂, λ₃ balance the tasks.
Training and Results – Models were trained on the public WIDER FACE dataset (32,203 images, 393,703 faces) using three backbones: MobileNet, ResNet‑50 and ResNet‑152. Evaluation on the WIDER FACE test set shows:
Backbone
Easy
Medium
Hard
MobileNet
90.55%
88.08%
73.52%
ResNet50
94.28%
92.52%
80.45%
ResNet152
94.74%
93.35%
82.91%
Face Alignment – Alignment normalizes pose and scale by detecting facial landmarks (e.g., 5‑point, 68‑point). Detected landmarks are transformed to a set of reference points using an affine transformation. The reference points used are:
REFERENCE_FACIAL_POINTS = [
[30.29459953+8, 51.69630051],
[65.53179932+8, 51.50139999],
[48.02519989+8, 71.73660278],
[33.54930115+8, 92.3655014],
[62.72990036+8, 92.20410156]
]Feature Embedding – The core of recognition is mapping a normalized face image to a high‑dimensional vector (e.g., 128‑D or 512‑D). ZTFace uses a ResNet‑50 backbone with a 512‑D fully‑connected layer, trained with Triplet‑Loss to enforce intra‑class compactness and inter‑class separation. Softmax loss alone is insufficient for verification tasks.
Verification & Recognition – Verification computes the Euclidean distance between two 512‑D vectors; a distance below a threshold (e.g., 1.0) indicates the same identity. A sigmoid function converts distance to similarity. Recognition matches a probe vector against a gallery database.
Anti‑Spoofing – A dedicated liveness detector trained on >90,000 real and spoof samples achieves 99.8% accuracy, protecting against printed photos, video replays and masks.
Attribute Recognition – A multi‑task CNN simultaneously predicts six attributes (age, gender, ethnicity, makeup, expression, mask) using six dedicated heads and combined loss functions.
Algorithm Acceleration – TensorRT, NVIDIA’s high‑performance inference optimizer, was used to accelerate both detection and embedding models. Speed comparisons (Pytorch vs. TensorRT) show:
Backbone
Frame
#faces
Time
Model Size
mobilenet
Pytorch
897
453 ms
1.8 M
ResNet50
Pytorch
777
1158.2 ms
109.5 M
ResNet152
Pytorch
721
1278.8 ms
248.6 M
ResNet50
TensorRT‑FP32
766
65 ms
195.9 M
ResNet50
TensorRT‑FP16
766
19 ms
68.1 M
For the embedding model, TensorRT‑FP32 reduces inference from 31.9 ms (Pytorch) to 3 ms, while half‑precision was avoided to preserve accuracy.
Conclusion & Outlook – ZTFace delivers a highly accurate and fast face‑recognition pipeline suitable for attendance, mobile login, smart city management, construction site safety and in‑vehicle data collection. Future work will continue to improve robustness and expand application scenarios.
Zhengtong Technical Team
How do 700+ nationwide projects deliver quality service? What inspiring stories lie behind dozens of product lines? Where is the efficient solution for tens of thousands of customer needs each year? This is Zhengtong Digital's technical practice sharing—a bridge connecting engineers and customers!
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.