Live Streaming Network Transmission: Protocols, Encoding, Decoding, and Synchronization
This article explains the end‑to‑end live‑streaming workflow, covering how a broadcaster pushes video and audio to a server, the various streaming protocols (RTMP, HTTP‑FLV, HLS, RTP), encoding formats, FFmpeg‑based decoding, hardware vs software decoding, and audio‑video synchronization techniques.
3. Network Transmission
Network transmission is the third stage of the entire live‑streaming process.
The transmission stage consists of three parts:
Broadcaster pushes stream to server, server distributes the stream, audience pulls the stream
The diagram below shows the architecture of video live streaming.
Let's look at the general flow of broadcaster push and audience pull.
First, a broadcaster starts a live session and must install a push‑stream SDK.
Then they request a push URL from the live‑cloud server – essentially a room number for the stream.
Without a room, there is nothing to stream.
The broadcaster also needs a push address to send the live data to the server.
For example, the streamer "XuXuBaobao" uses room number 99999.
This maps to a fixed set of push addresses.
Why multiple addresses? Because a stream may have more than one push URL.
Next, the broadcaster captures video using a phone camera or dedicated recording equipment.
The raw video data is generated in RGB or YUV (usually YUV) format.
Audio is captured via microphone and generated in raw PCM format.
Then each is encoded:
Video: YUV → H.264 (or H.265)
Audio: PCM → AAC
These steps were described earlier.
After encoding, the video and audio streams are multiplexed into FLV, TS, or RTMP packets , depending on the transmission protocol used.
The resulting files are called media stream files .
This introduces two concepts: media streaming and transmission protocols .
3.1 Popular Science
Media streaming refers to compressing a continuous series of media data, segmenting it for online transmission, and delivering audio‑video in real time for immediate viewing; the data packets flow like a stream.
Why is media streaming useful?
If you want to watch a movie without streaming, you must download the entire file first, which is impractical for live content.
Without streaming, you would have to wait until a broadcast ends to view the recorded video.
Streaming breaks this limitation by delivering live or pre‑stored media instantly.
When the viewer receives the media, a playback program starts playing it immediately.
In short: you send as much as you have, and the player plays it.
3.2 Transmission Protocol – Push
What is a transmission protocol ?
The broadcaster’s device performs a series of operations: video capture, noise reduction, image processing, flow control, beauty filters, etc. These are handled by the live‑stream SDK, not manually by the broadcaster.
The SDK generates the media stream files.
After the data is ready, it must be pushed to the server.
The method of pushing depends on the chosen protocol.
Common domestic live‑stream protocols include RTMP, HLS, HDL (HTTP‑FLV), RTP .
RTMP :
Originally Adobe’s patented protocol; most foreign CDNs no longer support it.
CDN (Content Delivery Network) is a network designed to efficiently deliver rich media over traditional IP networks.
Think of it as a network accelerator.
RTMP is popular in China for several reasons:
1. Strong support from open‑source software and libraries (e.g., OBS, librtmp, nginx‑rtmp plugin).
OBS is an open‑source PC push‑stream tool for Windows and macOS, free and extensible, suitable for game and show streaming.
librtmp is an open‑source library that supports RTMP push.
2. High client installation rate – browsers with Flash Player can easily play RTMP streams.
However, RTMP’s TCP‑based handshake can add >100 ms latency, and network loss can increase delay further.
3. Low latency – typical RTMP live streams have 2–5 seconds delay, better than HLS’s 5–20 seconds.
Since Adobe stopped updating Flash in 2020 and browsers dropped Flash support, RTMP cannot be played directly in browsers.
A common workaround is to re‑package RTMP into HTTP‑FLV and play it with flv.js .
Both RTMP and HTTP‑FLV use the FLV container; they differ only in transport protocol.
Now let’s look at HTTP‑FLV (HDL) :
It streams media over HTTP, which is simpler and not tied to Adobe patents.
Latency can also achieve 2–5 seconds, and HTTP’s stateless nature gives faster start‑up.
HLS :
Apple’s HTTP‑based streaming protocol. It has higher latency (5–20 seconds) and is sensitive to network fluctuations, but it can be played directly in HTML5 browsers without extra apps.
This makes HLS ideal for social live‑stream apps where sharing a link is essential.
HLS delivers an .m3u8 playlist that references small .ts video segments.
RTP :
Widely used in video surveillance, video conferencing, and IP telephony where real‑time delivery is critical and occasional packet loss is acceptable.
Other protocols such as WebRTC also exist.
The chosen push protocol determines how the live data is packaged before being sent to the cloud server.
After the server receives the stream, it generates pull URLs and distributes the stream via CDN nodes.
The server also provides additional functions such as high‑definition transcoding, recording, screenshot, and content moderation.
3.3 Transmission Protocol – Pull
When a user wants to watch a live stream, they need a pull URL on their device (phone, computer, etc.).
The pull protocol can be the same as or different from the push protocol; servers may perform protocol conversion to accommodate various clients.
For example, a broadcaster may push via RTMP while viewers pull via HLS.
4. Decoding and Playback
Decoding and playback are tightly coupled processes.
At the heart of most video players is FFmpeg , an open‑source multimedia framework.
4.1 FFmpeg
FFmpeg provides video/audio decoding, encoding, transcoding, and many post‑processing features. It supports virtually all codecs and containers, and can operate over HTTP, FTP, SMB, etc.
Many popular players (e.g., Bilibili’s ijkplayer) are built on FFmpeg/LAV.
Because FFmpeg is GPL/LGPL licensed, software that incorporates its code must comply with the license, which some commercial players have ignored.
FFmpeg‑based SDKs can decode streamed packets (FLV, TS, RTMP) into H.264 video and AAC audio, then further decode H.264 to YUV/RGB and AAC to PCM.
4.2 Hardware vs Software Decoding
Software decoding runs entirely on the CPU, which can become a bottleneck for high‑resolution video (e.g., 1080p). Hardware decoding offloads repetitive, data‑intensive tasks to the GPU, while the CPU still assists.
Advantages of software decoding: easier to implement, flexible, better quality at low bitrates. Advantages of hardware decoding: higher performance, lower power consumption, but may have lower quality at low bitrates unless advanced algorithms are ported. Typical latency for hardware‑decoded streams is comparable to software‑decoded streams (2–5 seconds). 4.3 Audio‑Video Synchronization After decoding, we obtain separate video frames (YUV/RGB) and audio samples (PCM). They must be synchronized; otherwise the viewer experiences lip‑sync errors. Each frame carries timestamps: DTS (Decoding Time Stamp) and PTS (Presentation Time Stamp). DTS tells the decoder when to decode a frame; PTS tells the renderer when to display it. Video frames are of three types: I‑frames (intra‑coded, independent), P‑frames (predictive, depend on previous frames), and B‑frames (bidirectional, depend on both previous and future frames). Because B‑frames may appear out of display order, DTS and PTS are essential to reorder frames correctly. Audio streams typically have matching DTS and PTS. Synchronization is usually achieved by using audio as the master clock: the audio playback runs at a steady rate, and the video rendering thread adjusts its timing to match the audio timestamps. If video lags behind audio, the renderer may drop frames; if video runs ahead, it may pause briefly. The acceptable audio‑video timestamp difference is defined by RFC‑1359: differences between –100 ms and +25 ms are imperceptible; beyond that, users may notice; larger deviations are unacceptable. 4.4 Final Step After synchronization, the audio data is sent to the audio output device and the video data to the video output device, producing the final live‑stream picture the user sees.
New Oriental Technology
Practical internet development experience, tech sharing, knowledge consolidation, and forward-thinking insights.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.