Artificial Intelligence 16 min read

An Introduction to GPU Computing and CUDA Architecture

This article provides a concise overview of GPU computing fundamentals, covering GPU hardware components, memory hierarchy, parallel execution models, and the CUDA programming framework, illustrating how CPUs and GPUs cooperate in heterogeneous computing environments.

TAL Education Technology
TAL Education Technology
TAL Education Technology
An Introduction to GPU Computing and CUDA Architecture

GPU computing has become essential for deep learning and many other fields; this article introduces the basic concepts and operation principles of GPUs to help readers understand how they work.

GPU (Graphics Processing Unit) originally handled graphics rendering, but since 2003 the concept of GPGPU (General‑Purpose GPU) has enabled GPUs to perform general‑purpose parallel computation.

Compared with CPUs, GPUs contain thousands of simple ALU cores (SPs) and a few complex control units, making them ideal for massive data‑parallel tasks. In heterogeneous computing, the CPU (host) manages control flow and data movement, while the GPU (device) executes compute‑intensive kernels via the PCI‑E bus.

The typical GPU hardware hierarchy consists of Streaming Processors (SP, also called CUDA cores) grouped into Streaming Multiprocessors (SM). Several SMs may be further organized into Texture Processing Clusters (TPC) in some architectures. Each SM can host hundreds of threads concurrently.

Within an SM, resources include an instruction cache, L1 cache, shared memory, and a register file. Shared memory and registers are fast but limited (kilobyte scale), so efficient usage is crucial for high occupancy. The Load/Store unit (LD/ST) handles memory accesses, while the Special Function Unit (SFU) implements functions such as __cos(). Warps (groups of 32 threads) are the scheduling unit; all threads in a warp execute the same instruction, and divergence reduces performance.

GPU memory hierarchy includes global memory (the visible VRAM), shared memory (per‑SM), and registers (per‑SP). Global memory is large but slower; registers and shared memory are fast but scarce. Efficient kernel design balances usage of these tiers to maximize parallelism.

On the software side, CUDA provides a programming model that maps computation tasks onto the GPU hardware. A task is represented as a Grid, which is divided into Blocks, each containing many Threads. Threads run on SPs, Blocks are scheduled on SMs, and Grids occupy the whole GPU.

CUDA kernels are declared with the __global__ qualifier, indicating they are called from the host but execute on the device. Functions that run only on the device use __device__ . CUDA source files have the .cu extension and are compiled with nvcc , which separates host and device code.

Typical CUDA program flow:

1. Define a kernel with __global__ . 2. Allocate device memory with cudaMalloc . 3. Transfer input data to the device using cudaMemcpy . 4. Launch the kernel, specifying grid and block dimensions, e.g.: dim3 Grid(3, 2); dim3 Block(5, 3); kernel_fun<< >>(params...); 5. Retrieve results with cudaMemcpy (synchronous) or cudaMemcpyAsync (asynchronous). 6. Free device memory with cudaFree .

CUDA also supports streams, which are sequences of commands that execute in order; multiple streams can run concurrently, allowing overlapping of computation and data transfer.

In summary, understanding GPU hardware (SP, SM, Warp, memory tiers) and the CUDA programming model enables developers to harness massive parallelism for AI, scientific computing, and other compute‑intensive workloads.

CUDAParallel ComputingGPUCUDA ProgrammingGPU architecture
TAL Education Technology
Written by

TAL Education Technology

TAL Education is a technology-driven education company committed to the mission of 'making education better through love and technology'. The TAL technology team has always been dedicated to educational technology research and innovation. This is the external platform of the TAL technology team, sharing weekly curated technical articles and recruitment information.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.