Operations 12 min read

How to Achieve Ultra-Fast Video Transcoding with Multi‑Threading and Clustering

This article examines the factors affecting media transcoding duration, explains why decoding dominates processing time, and presents multi‑threaded FFmpeg techniques and a cluster‑based architecture that split video into TS segments to dramatically reduce overall transcoding latency.

360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
360 Zhihui Cloud Developer
How to Achieve Ultra-Fast Video Transcoding with Multi‑Threading and Clustering

Media Transcoding Overview

With the rapid development of digital media, media formats and usage scenarios have become increasingly diverse. Media transcoding services convert original media files into formats suitable for different platforms, devices, and applications, meeting user needs for viewing, playback, storage, and transmission.

Transcoding duration is a key metric for evaluating service performance and efficiency. Understanding the factors that affect transcoding time and how to optimize the process is crucial for improving user experience, reducing costs, and enhancing business competitiveness.

Transcoding Process

Video transcoding essentially involves demuxing, decoding, re‑encoding, and remuxing compressed video, as illustrated below:

The majority of processing time is spent on decoding and re‑encoding, not on demuxing or remuxing.

1. Video encoding requires complex operations such as motion estimation, DCT transformation, quantization, and entropy coding. For example, H.264 motion estimation traverses multiple reference frames, and its computational load grows exponentially with the search range. Demuxing only parses file headers, which is far less intensive.

2. Decoding is largely serial; B‑frames must wait for preceding and following reference frames to be decoded, causing pipeline stalls.

In short, decoding has high computational cost and requires blocking waits.

1. Multi‑Threaded Encoding/Decoding

Multi‑threading offers far higher efficiency than single‑threaded processing. FFmpeg supports multi‑threaded encoding and decoding. The data flow in FFmpeg during transcoding is shown below:

AVPacket stores encoded data (e.g., H.264, H.265, VP8), while AVFrame holds decoded data (YUV). FFmpeg submits decoding tasks per packet, and each thread in the thread list processes a decoding task.

The main thread extracts decoding threads from the list in order and submits tasks. After submission, it also retrieves completed frames from the threads, adjusting for B‑frame ordering when necessary.

Performance measurements for transcoding a 1080p video to 720p with different thread counts are summarized below:

1 thread: Total time 42.78 s, CPU utilization 110 % (single core busy, others idle).

2 threads: Total time 25.81 s, CPU utilization 189 %, 39.7 % speedup – initial parallelization of decode/encode pipeline.

4 threads: Total time 19.93 s, CPU utilization 257 %, 53.4 % speedup – physical core saturation, memory bandwidth becomes bottleneck.

8 threads: Total time 15.52 s, CPU utilization 315 %, 63.7 % speedup – diminishing returns.

16 threads: Total time 15–18 s, CPU utilization 320–350 %, 64.9–65 % speedup – thread contention increases, some scenarios see time regression.

FFmpeg’s multi‑threaded transcoding shows clear diminishing returns due to thread synchronization overhead and competition for resources. The machine configuration of a single transcoding unit determines the upper bound of transcoding time, and simply increasing thread count is not an optimal solution when multiple tasks run concurrently.

2. Using a Transcoding Cluster for Ultra‑Fast Transcoding

2.1 Current Transcoding Cluster

Typical clusters fetch tasks from a queue, download the source video, transcode it, and upload the result to storage, as illustrated below:

Enabling FFmpeg’s multi‑threaded mode and adjusting the maximum task count can speed up processing, but performance remains limited by each individual transcoding unit.

2.2 Rapid Transcoding Cluster Design

Since only the remuxing step is fast, we can split the uploaded MP4/FLV files into TS segments and generate an M3U8 index. Each TS segment can be transcoded independently on separate units, fully utilizing cluster resources.

After remuxing, the TS files are uploaded to storage.

The master creates sub‑task lists in Redis, each containing the TS segment URL, transcoding parameters, and task status. Slaves poll the list, check CPU usage, and only idle units fetch sub‑tasks.

Compared with a single‑unit approach, the cluster adds three small storage I/O operations (master uploads TS segments, slaves download segments, slaves upload transcoded segments), which have negligible impact on overall latency.

2.3 Data Comparison

With 30 compute units processing 30 TS segments in parallel, the theoretical transcoding time drops from ~600 s (serial) to about 20 s. Test results show:

Standard transcoding times are shown for comparison:

Speed‑up ratios: 480p × 24.5, 720p × 27.79, 1080p × 36.72.

These findings demonstrate that multi‑threaded FFmpeg combined with a well‑designed clustering strategy can dramatically reduce video transcoding latency, meeting stringent user experience requirements.

Performance OptimizationmultithreadingFFmpegvideo transcodingcluster computing
360 Zhihui Cloud Developer
Written by

360 Zhihui Cloud Developer

360 Zhihui Cloud is an enterprise open service platform that aims to "aggregate data value and empower an intelligent future," leveraging 360's extensive product and technology resources to deliver platform services to customers.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.