Mobile Development 8 min read

Introduction to OpenCL Programming for Mobile GPU Computing

As mobile CPUs plateau, developers increasingly use OpenCL to harness Android GPUs like Qualcomm Adreno and Huawei Mali for heterogeneous computing, leveraging its platform, execution, and memory models to write portable kernels—illustrated by a simple array‑addition example that demonstrates device initialization, kernel creation, buffer management, and parallel execution.

Baidu App Technology
Baidu App Technology
Baidu App Technology
Introduction to OpenCL Programming for Mobile GPU Computing

With the continuous improvement of mobile chip performance, performing compute‑intensive tasks such as computer graphics and deep‑learning inference on mobile devices is no longer a luxury. On mobile platforms, GPUs—thanks to their strong floating‑point performance and good API compatibility—have become a crucial compute unit for heterogeneous computing. Currently, Qualcomm Adreno and Huawei Mali dominate the Android GPU market, and both provide solid OpenCL support.

Baidu's APP has already applied GPU acceleration to deep‑model inference and other compute‑heavy services. This article introduces the basic concepts of OpenCL and demonstrates a simple OpenCL programming example.

Heterogeneous Computing (异构计算) refers to using different instruction sets and architectures (CPU, GPU, DSP, ASIC, FPGA, etc.) together in a system to perform computations.

GPU (Graphics Processing Unit) is a specialized microprocessor for parallel graphics and compute workloads. Compared with CPUs, GPUs offer higher performance‑per‑area and performance‑per‑watt ratios, making them ideal for parallel tasks.

OpenCL (Open Computing Language) is an open standard managed by the Khronos Group. It provides a C99‑based kernel language and a platform‑agnostic API that can target CPUs, GPUs, DSPs, FPGAs, and other accelerators.

The OpenCL programming model consists of three core components: the platform model, the execution model, and the memory model.

Platform Model : Describes the topology of compute resources. On Android, the host is the CPU, and each GPU compute device contains multiple compute units, each of which contains many processing elements (PEs). For GPUs, compute units correspond to streaming multiprocessors.

Execution Model : Kernels are launched with clEnqueueNDRangeKernel , allowing N‑dimensional data‑parallel execution. For a 2‑D image, each pixel can be processed by a separate work‑item, achieving massive parallelism.

Memory Model : OpenCL defines four memory regions:

Host Memory – directly accessible by the CPU.

Global/Constant Memory – visible to all compute units.

Local Memory – shared among work‑items within a work‑group.

Private Memory – private to each work‑item.

Below is a concise example that adds two arrays using OpenCL. The code is taken from the open‑source repository array_add.cpp .

1. Initialize the OpenCL environment (device, context, command queue):

cl_int status;
// init device
runtime.device = init_device();
// create context
runtime.context = clCreateContext(nullptr, 1, &runtime.device, nullptr, nullptr, &status);
// create queue
runtime.queue = clCreateCommandQueue(runtime.context, runtime.device, 0, &status);

2. Build the program and create the kernel:

cl_int status;
// init program
runtime.program = build_program(runtime.context, runtime.device, PROGRAM_FILE);
// create kernel
runtime.kernel = clCreateKernel(runtime.program, KERNEL_FUNC, &status);

3. Prepare input/output buffers and set kernel arguments:

// init data
float input_data[ARRAY_SIZE];
float bias_data[ARRAY_SIZE];
float output_data[ARRAY_SIZE];
for (int i = 0; i < ARRAY_SIZE; i++) {
    input_data[i] = 1.f * (float)i;
    bias_data[i] = 10000.f;
}
// create buffers
runtime.input_buffer = clCreateBuffer(runtime.context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, ARRAY_SIZE * sizeof(float), input_data, &status);
runtime.bias_buffer = clCreateBuffer(runtime.context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, ARRAY_SIZE * sizeof(float), bias_data, &status);
runtime.output_buffer = clCreateBuffer(runtime.context, CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR, ARRAY_SIZE * sizeof(float), output_data, &status);
// set kernel args
status = clSetKernelArg(runtime.kernel, 0, sizeof(cl_mem), &runtime.input_buffer);
status |= clSetKernelArg(runtime.kernel, 1, sizeof(cl_mem), &runtime.bias_buffer);
status |= clSetKernelArg(runtime.kernel, 2, sizeof(cl_mem), &runtime.output_buffer);

4. Enqueue the kernel, read back the results, and process them:

// launch kernel
status = clEnqueueNDRangeKernel(runtime.queue, runtime.kernel, 1, nullptr, &ARRAY_SIZE, nullptr, 0, nullptr, nullptr);
// read output
status = clEnqueueReadBuffer(runtime.queue, runtime.output_buffer, CL_TRUE, 0, sizeof(output_data), output_data, 0, nullptr, nullptr);
// further processing of output_data
...

In summary, as CPU performance reaches its limits, programming GPUs and other accelerators becomes an essential skill. This article covered the fundamentals of OpenCL programming on mobile devices. Follow the Baidu APP technology public account for future articles on mobile heterogeneous computing.

Mobile DevelopmentAndroidC programmingOpenCLGPU computingheterogeneous computing
Baidu App Technology
Written by

Baidu App Technology

Official Baidu App Tech Account

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.