Game Development 66 min read

GPU Rendering Pipeline and Hardware Architecture Overview

The article surveys GPU rendering pipelines and hardware architectures for desktop and mobile, explains classic stages, compares Immediate Mode, Tile‑Based and Tile‑Based Deferred rendering, details PowerVR, Mali and Adreno components, and offers optimization advice on draw calls, depth pre‑passes, shader efficiency, and render ordering.

Tencent Cloud Developer

Nov 29, 2022

GPU Rendering Pipeline and Hardware Architecture Overview

This article provides a comprehensive overview of GPU rendering pipelines and hardware architectures, covering both desktop and mobile platforms. It begins with an introduction to the classic rendering pipeline stages—application, vertex processing, rasterization, fragment processing, and per-pixel operations—explaining how data flows from the CPU to the GPU and how each stage transforms the data.

The text then discusses various rendering architectures, including Immediate Mode Rendering (IMR) used on desktops, Tile-Based Rendering (TBR) common on mobile GPUs, and Tile-Based Deferred Rendering (TBDR) which adds hidden‑surface removal. Advantages and disadvantages of each approach are analyzed, highlighting bandwidth usage, power consumption, and latency considerations.

Detailed sections describe the hardware components of modern GPUs from major vendors: PowerVR, Mali, and Adreno. For each architecture, the article explains core units such as Unified Shading Clusters, Execution Engines, ALUs, cache hierarchies, on‑chip memory, and specialized features like Hidden Surface Removal (HSR), Low‑Resolution Z (LRZ), Forward Pixel Kill (FPK), and SIMD/SIMT execution models. It also outlines the evolution of Mali GPUs from Utgard to Valhall, noting changes in warp size, scalar vs. vector processing, and super‑scalar capabilities.

Practical performance topics are covered, including the impact of draw calls, AlphaTest vs. AlphaBlend, sorting of opaque and transparent objects, the usefulness of depth pre‑passes (PreZ), and the effects of shader branching, multi‑compile, and register spilling. Recommendations for using AlphaTest, PreZ, and proper render order on both desktop and mobile GPUs are provided.

Additional optimization techniques are presented, such as load/store actions, memoryless render targets, avoiding frequent render‑target switches, minimizing CPU‑GPU readbacks, and leveraging pixel‑local storage. The article also examines the performance implications of MSAA, Alpha‑to‑Coverage, and shader instruction costs, offering concrete advice for writing efficient shaders (e.g., preferring MAD, avoiding expensive functions, using half precision, and reducing register pressure).

Finally, the article summarizes key takeaways: warp sizes differ across GPUs, modern mobile GPUs use scalar architectures with super‑scalar execution, hidden‑surface removal techniques vary by vendor, and careful ordering of opaque, AlphaTest, and transparent passes is essential for optimal performance.

Original Source

Signed-in readers can open the original source through BestHub's protected redirect.

Republication Notice

This article has been distilled and summarized from source material, then republished for learning and reference. If you believe it infringes your rights, please contactand we will review it promptly.

Graphics Performance Optimization GPU Rendering Pipeline Mobile GPU TBDR TBR

Written by

Tencent Cloud Developer

Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.

0 followers

Reader feedback

How this landed with the community

Rate this article

Was this worth your time?

Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.