GPU Rendering Pipeline and Hardware Architecture Overview
The article surveys GPU rendering pipelines and hardware architectures for desktop and mobile, explains classic stages, compares Immediate Mode, Tile‑Based and Tile‑Based Deferred rendering, details PowerVR, Mali and Adreno components, and offers optimization advice on draw calls, depth pre‑passes, shader efficiency, and render ordering.
This article provides a comprehensive overview of GPU rendering pipelines and hardware architectures, covering both desktop and mobile platforms. It begins with an introduction to the classic rendering pipeline stages—application, vertex processing, rasterization, fragment processing, and per-pixel operations—explaining how data flows from the CPU to the GPU and how each stage transforms the data.
The text then discusses various rendering architectures, including Immediate Mode Rendering (IMR) used on desktops, Tile-Based Rendering (TBR) common on mobile GPUs, and Tile-Based Deferred Rendering (TBDR) which adds hidden‑surface removal. Advantages and disadvantages of each approach are analyzed, highlighting bandwidth usage, power consumption, and latency considerations.
Detailed sections describe the hardware components of modern GPUs from major vendors: PowerVR, Mali, and Adreno. For each architecture, the article explains core units such as Unified Shading Clusters, Execution Engines, ALUs, cache hierarchies, on‑chip memory, and specialized features like Hidden Surface Removal (HSR), Low‑Resolution Z (LRZ), Forward Pixel Kill (FPK), and SIMD/SIMT execution models. It also outlines the evolution of Mali GPUs from Utgard to Valhall, noting changes in warp size, scalar vs. vector processing, and super‑scalar capabilities.
Practical performance topics are covered, including the impact of draw calls, AlphaTest vs. AlphaBlend, sorting of opaque and transparent objects, the usefulness of depth pre‑passes (PreZ), and the effects of shader branching, multi‑compile, and register spilling. Recommendations for using AlphaTest, PreZ, and proper render order on both desktop and mobile GPUs are provided.
Additional optimization techniques are presented, such as load/store actions, memoryless render targets, avoiding frequent render‑target switches, minimizing CPU‑GPU readbacks, and leveraging pixel‑local storage. The article also examines the performance implications of MSAA, Alpha‑to‑Coverage, and shader instruction costs, offering concrete advice for writing efficient shaders (e.g., preferring MAD, avoiding expensive functions, using half precision, and reducing register pressure).
Finally, the article summarizes key takeaways: warp sizes differ across GPUs, modern mobile GPUs use scalar architectures with super‑scalar execution, hidden‑surface removal techniques vary by vendor, and careful ordering of opaque, AlphaTest, and transparent passes is essential for optimal performance.
Tencent Cloud Developer
Official Tencent Cloud community account that brings together developers, shares practical tech insights, and fosters an influential tech exchange community.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.