Fundamentals 39 min read

Inside Intel GPU Render Engine: How 3D Rendering Works at the Hardware Level

This article explains the architecture and workflow of Intel's GPU render engine, covering the 3D pipeline, command streamer, fixed‑function units, execution units, URB handling, thread dispatch, shader stages, sampler state, and the Mesa driver implementation that translates OpenGL commands into hardware instructions.

ByteDance SYS Tech
ByteDance SYS Tech
ByteDance SYS Tech
Inside Intel GPU Render Engine: How 3D Rendering Works at the Hardware Level

Preface

GPU (Graphics Processing Unit) is a micro‑processor specialized for graphics‑related computation on PCs, workstations, consoles and mobile devices. With the rise of AI, GPUs are also used for parallel training and inference workloads, leading to many servers equipped with GPUs.

Terminology

3D Pipeline : A set of fixed‑function units arranged as a pipeline that processes 3D commands using both fixed‑function (FF) units and Execution Units (EUs).

FFID : Unique identifier for a fixed‑function unit.

CS (Command Streamer) : Parses commands written by the driver into a ring buffer and forwards them to the next stage of the 3D pipeline.

VF (Vertex Fetcher) : The first FF unit that reads vertex data from memory and passes it to the Vertex Shader (VS) stage.

TD (Thread Dispatcher) : Arbitrates thread start requests from FF units and instantiates threads on EUs.

Render Engine Overview

Intel's render engine operates in two modes: 3D rendering and media (codec) mode, which share the same pipeline selection mechanism. The driver selects the mode with the PIPELINE_SELECT command.

<code>// In Mesa code for compute (GPGPU)
emit_pipeline_select(batch, GPGPU);
// In render mode
emit_pipeline_select(batch, _3D);</code>

Both user‑space and kernel‑space drivers write commands into a ring buffer, which the hardware then executes according to the selected pipeline.

Render pipeline command flow
Render pipeline command flow

Hardware Introduction and Analysis

Command Streamer

The CPU writes commands into a batch buffer, which the driver converts to a ring buffer. The Command Streamer reads these commands and dispatches them to the appropriate hardware blocks.

Command Streamer diagram
Command Streamer diagram

GPU render commands are broadly classified as:

Memory interface commands – operate on memory.

3D state commands – set up the 3D pipeline state (e.g., vertex surface state).

Pipe Control commands – configure synchronization and parallel execution.

3D Primitive commands – describe primitive assembly.

Fixed‑Function Units (FF)

FF units manage most of the processing for vertex and pixel data on the EU threads. They handle thread dispatching, URB entry management, and various control information.

An EU is a multi‑threaded processor within the multi‑processor system. Each EU contains instruction fetch/decode, register files, SIMD ALU, etc.

Execution Unit (EU)

EUs are programmable cores that execute shader and kernel code. They contain General Register Files (GRF) and Architecture Register Files (ARF). Modern generations (Gen11/Gen12) have 7‑thread SIMD units.

<code>add dst.xyz src0.yxzw src1.zwxy</code>

Unified Return Buffer (URB)

URB is an on‑chip memory shared by FF units to pass data between threads and fixed‑function stages. Threads read/write URB entries via messages.

URB layout
URB layout

Thread Dispatching

When a pipeline stage requests a thread, the Thread Dispatcher allocates register space on an EU, loads control information from URB, and starts execution.

Thread dispatch flow
Thread dispatch flow

Shader Stages

Vertex Shader (VS)

After VF writes vertex data to URB, the VS reads it via URB handles, launches EU threads, and runs the compiled shader. The driver emits 3DSTATE_VS with kernel start pointer, binding table size, scratch space, etc.

<code>#define INIT_THREAD_DISPATCH_FIELDS(pkt, prefix, stage) \
   pkt.KernelStartPointer = KSP(shader);               \
   pkt.BindingTableEntryCount = shader->bt.size_bytes / 4; \
   pkt.FloatingPointMode = prog_data->use_alt_mode;   \
   pkt.DispatchGRFStartRegisterForURBData = prog_data->dispatch_grf_start_reg; \
   pkt.prefix##URBEntryReadLength = vue_prog_data->urb_read_length; \
   pkt.prefix##URBEntryReadOffset = 0; \
   pkt.StatisticsEnable = true; \
   pkt.Enable = true;

static void iris_store_vs_state(struct iris_context *ice,
                                 const struct gen_device_info *devinfo,
                                 struct iris_compiled_shader *shader) {
   struct brw_stage_prog_data *prog_data = shader->prog_data;
   struct brw_vue_prog_data *vue_prog_data = (void *) prog_data;
   iris_pack_command(GENX(3DSTATE_VS), shader->derived_data, vs) {
      INIT_THREAD_DISPATCH_FIELDS(vs, Vertex, MESA_SHADER_VERTEX);
      vs.MaximumNumberofThreads = devinfo->max_vs_threads - 1;
      vs.SIMD8DispatchEnable = true;
      vs.UserClipDistanceCullTestEnableBitmask = vue_prog_data->cull_distance_mask;
   }
}</code>

Sampler State

The sampler provides filtered texture values to the EU. Sampler state objects are created via OpenGL calls (e.g., glGenSamplers ) and translated by Mesa into hardware SAMPLER_STATE entries.

<code>iris_pack_state(GENX(SAMPLER_STATE), cso->sampler_state, samp) {
   samp.TCXAddressControlMode = wrap_s;
   samp.TCYAddressControlMode = wrap_t;
   samp.TCZAddressControlMode = wrap_r;
   samp.CubeSurfaceControlMode = state->seamless_cube_map;
   samp.NonnormalizedCoordinateEnable = !state->normalized_coords;
   samp.MinModeFilter = state->min_img_filter;
   samp.MagModeFilter = mag_img_filter;
   samp.MipModeFilter = translate_mip_filter(state->min_mip_filter);
   samp.MaximumAnisotropy = RATIO21;
   if (state->max_anisotropy >= 2) {
      if (state->min_img_filter == PIPE_TEX_FILTER_LINEAR) {
         samp.MinModeFilter = MAPFILTER_ANISOTROPIC;
         samp.AnisotropicAlgorithm = EWAApproximation;
      }
      if (state->mag_img_filter == PIPE_TEX_FILTER_LINEAR)
         samp.MagModeFilter = MAPFILTER_ANISOTROPIC;
      samp.MaximumAnisotropy = MIN2((state->max_anisotropy - 2) / 2, RATIO161);
   }
}</code>

Mesa 3D Driver Implementation

Mesa translates OpenGL state into GPU commands. For example, vertex buffers are emitted with 3DSTATE_VERTEX_BUFFERS , index buffers with 3DSTATE_INDEX_BUFFERS , and binding tables with 3DSTATE_BINDING_TABLE_POINTERS . The driver allocates buffers in specific memory zones (e.g., IRIS_MEMZONE_BINDER ) and writes the virtual addresses into the batch buffer.

<code>/** Memory zones. When allocating a buffer, you can request a specific region of the virtual address space (PPGTT). */
enum iris_memory_zone {
   IRIS_MEMZONE_SHADER,
   IRIS_MEMZONE_BINDER,
   IRIS_MEMZONE_SCRATCH,
   IRIS_MEMZONE_SURFACE,
   IRIS_MEMZONE_DYNAMIC,
   IRIS_MEMZONE_OTHER,
   IRIS_MEMZONE_BORDER_COLOR_POOL,
};

static void iris_set_vertex_buffers(struct pipe_context *ctx,
                                   unsigned start_slot, unsigned count,
                                   unsigned unbind_num_trailing_slots,
                                   bool take_ownership,
                                   const struct pipe_vertex_buffer *buffers) {
   // ... pack VERTEX_BUFFER_STATE ...
   iris_pack_state(GENX(VERTEX_BUFFER_STATE), vb) {
      vb.VertexBufferIndex = start_slot + i;
      vb.AddressModifyEnable = true;
      vb.BufferPitch = buffer->stride;
      if (res) {
         vb.BufferSize = res->base.b.width0 - (int)buffer->buffer_offset;
         vb.BufferStartingAddress = ro_bo(NULL, res->bo->address + (int)buffer->buffer_offset);
         vb.MOCS = iris_mocs(res->bo, &amp;screen->isl_dev, ISL_SURF_USAGE_VERTEX_BUFFER_BIT);
      } else {
         vb.NullVertexBuffer = true;
         vb.MOCS = iris_mocs(NULL, &amp;screen->isl_dev, ISL_SURF_USAGE_VERTEX_BUFFER_BIT);
      }
   }
}
</code>

The driver also manages state binding tables, sampler tables, and surface state tables, emitting the corresponding 3DSTATE_* commands to bind them to the hardware.

Overall driver flow
Overall driver flow

References

Intel Graphics PRM – Command Stream Programming (DG1)

Intel Graphics Architecture ISA and Microarchitecture

Intel Graphics Core Documentation (965, SKL, etc.)

GPUIntelMesaDriverGraphics PipelineRender Engine
ByteDance SYS Tech
Written by

ByteDance SYS Tech

Focused on system technology, sharing cutting‑edge developments, innovation and practice, and analysis of industry tech hotspots.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.