Artificial Intelligence 15 min read

TDQA: A No-Reference Deep Learning Based Video Quality Assessment Algorithm for Live Streaming

TDQA is a no‑reference, deep‑learning video quality assessment algorithm designed for live‑streaming, built on a large subjectively annotated dataset and an end‑to‑end architecture with fine‑tuned backbones, achieving state‑of‑the‑art accuracy and sub‑second inference for real‑time quality monitoring and pipeline optimization.

Tencent Music Tech Team
Tencent Music Tech Team
Tencent Music Tech Team
TDQA: A No-Reference Deep Learning Based Video Quality Assessment Algorithm for Live Streaming

Background: With the rise of live streaming, short videos, and various audio‑video applications, platforms such as QQ Music and 全民K歌 have hundreds of millions of active users, generating massive amounts of user‑generated video content daily. Ensuring end‑to‑end video quality for live streaming and video services is critical for chain‑wide optimization. Building an efficient and accurate video quality assessment system is therefore essential.

This article shares the TDQA (TME Deep‑learning based Quality Assessment) algorithm developed by the Tencent Music technology team, a no‑reference video clarity assessment method tailored for live‑streaming scenarios. It describes dataset construction, model design and training, and result analysis.

Research status: Video Quality Assessment (VQA) is a specialized field. Subjective methods are labor‑intensive, while objective methods can be full‑reference, reduced‑reference, or no‑reference. No‑reference methods are most practical but challenging due to the diversity of video content. Traditional no‑reference approaches rely on handcrafted features, whereas recent deep‑learning methods use CNNs to extract robust quality features and map them to subjective scores.

TDQA algorithm: TDQA is an end‑to‑end, no‑reference video clarity assessment algorithm that does not require a high‑quality reference video. It processes video frames through preprocessing, feature extraction, and a regression/classification head to output a quality score.

3.1 Dataset construction: Existing public IQA datasets (e.g., TID2013, KonIQ‑10K) are limited in size and domain mismatch for live‑streaming. Therefore, a large, high‑quality, subjectively annotated dataset specific to live‑streaming scenarios was built.

3.1.1 Data sampling: Video clips were sampled according to three principles: (1) continuous clarity distribution from blurry to HD; (2) inclusion of diverse scenes (indoor, outdoor, avatar, sandwich, etc.); (3) coverage of various Tencent products beyond 全民K歌.

3.1.2 Scoring method: A three‑level rating scheme was adopted (instead of five) to simplify crowdsourcing. Approximately 30 groups of videos were annotated by about 100 external crowd workers, with each video receiving around 70 ratings. Redundant samples were used for consistency checks.

3.1.3 Data cleaning: Three rules were applied: (1) if a worker’s scores for the same video differed greatly, all their annotations in that group were discarded; (2) if a worker gave the same score for >90% of items, their data were discarded; (3) if a worker’s scores deviated significantly from the group mean, their data were discarded.

3.2 Algorithm and analysis:

3.2.1 Frame input size: Unlike traditional methods that use full resolution, TDQA resizes frames to 384×512 (close to 16:9/4:3) and pads with black borders, preserving sufficient visual information while keeping GPU memory manageable.

3.2.2 End‑to‑end network architecture: The model consists of four mandatory modules—preprocessing, backbone network, Global Average Pooling (GAP), and a classification/regression head. Optional feature‑fusion modules (e.g., hyper‑column, bilinear pooling) can be added after the backbone. Common backbones include ResNet‑18, MobileNet, or Inception‑ResNet‑v2.

3.2.3 Model training: A two‑step fine‑tuning strategy is used. First, the backbone is pretrained on ImageNet; second, it is fine‑tuned on the constructed live‑streaming dataset. Example hyper‑parameters for Inception‑ResNet‑v2 are shown below.

{   epoch: 70,
lr: 3e-4,
batch_size: 36,
weight_decay: 5e-6,
… }
{   epoch: 35,
lr: 6e-5,
batch_size: 36,
weight_decay: 5e-4,
… }

3.3 Deployment and applications: After fine‑tuning, TDQA achieves PLCC and SRCC comparable to state‑of‑the‑art methods. Optimizations in video decoding, inference, and service loading enable prediction times under one second per video, suitable for real‑time monitoring. The model has been integrated into multiple internal live‑streaming and video services, providing continuous quality monitoring and guiding quality improvements.

Conclusion: End‑to‑end video quality assessment is indispensable for enhancing user experience and optimizing the live‑streaming pipeline. This paper presented the research status, dataset construction, network design, training, and deployment of a no‑reference clarity assessment algorithm for live‑streaming. Future work includes extending the method to other scenarios such as image aesthetic assessment and improving cross‑domain generalization.

References: (omitted for brevity)

live streamingDeep LearningModel Trainingdataset constructionNo-Referencevideo quality assessmentTDQA
Tencent Music Tech Team
Written by

Tencent Music Tech Team

Public account of Tencent Music's development team, focusing on technology sharing and communication.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.