Fundamentals 19 min read

Understanding NumPy Array Memory Layout and Accelerating Image Resizing with Unsafe Python Techniques

This article explains how NumPy stride differences caused a 100× slowdown when resizing images from a pygame Surface, demonstrates how to reinterpret the underlying memory layout using ctypes to achieve a 100× speedup with OpenCV, and discusses the safety implications of such low‑level Python tricks.

Python Programming Learning Circle
Python Programming Learning Circle
Python Programming Learning Circle
Understanding NumPy Array Memory Layout and Accelerating Image Resizing with Unsafe Python Techniques

Unsafe Python and Image Resizing

When using pygame to resize images, the author observed a large performance gap between pygame.transform.smoothscale and cv2.resize . The slowdown was traced to NumPy array stride differences caused by the memory layout of the pygame Surface data.

Benchmarking the Original Approaches

Using a 1920×1080 surface, the following benchmark was performed (repeat = 10):

<code>from contextlib import contextmanager
import time
import pygame as pg
import numpy as np
import cv2

@contextmanager
def Timer(name):
    start = time.time()
    yield
    finish = time.time()
    print(f'{name} took {finish-start:.4f} sec')

IW = 1920
IH = 1080
OW = IW // 2
OH = IH // 2
repeat = 10

isurf = pg.Surface((IW,IH), pg.SRCALPHA)
with Timer('pg.Surface with smoothscale'):
    for i in range(repeat):
        pg.transform.smoothscale(isurf, (OW,OH))

def cv2_resize(image):
    return cv2.resize(image, (OH,OW), interpolation=cv2.INTER_AREA)

i1 = np.zeros((IW,IH,3), np.uint8)
with Timer('np.zeros with cv2'):
    for i in range(repeat):
        o1 = cv2_resize(i1)
</code>

Output:

<code>pg.Surface with smoothscale took 0.2002 sec
np.zeros with cv2 took 0.0145 sec</code>

When accessing the surface pixels via pygame.surfarray.pixels3d (zero‑copy) and then calling cv2.resize , the runtime increased dramatically:

<code>i2 = pg.surfarray.pixels3d(isurf)
with Timer('pixels3d with cv2'):
    for i in range(repeat):
        o2 = cv2_resize(i2)
</code>
<code>pixels3d with cv2 took 1.3625 sec</code>

Why the Stride Difference?

Both arrays have the same .shape and .dtype , but their .strides differ:

<code>print('input strides',i1.strides,i2.strides)
print('output strides',o1.strides,o2.strides)
</code>
<code>input strides (3240, 3, 1) (4, 7680, -1)
output strides (1620, 3, 1) (1620, 3, 1)</code>

The negative z‑stride and the extra byte‑step (4 instead of 3) come from the pygame Surface storing pixels as BGRA with an alpha channel, and the base pointer being offset to the red component.

Re‑interpreting the Memory Layout

By obtaining the raw C pointer of the surface and wrapping it in a NumPy array with the default (height‑major) strides, the layout can be treated as a regular RGBA image, effectively transposing the data and swapping R↔B without copying:

<code>import ctypes

def arr_params(surface):
    pixels = pg.surfarray.pixels3d(surface)
    width, height, depth = pixels.shape
    assert depth == 3
    xstride, ystride, zstride = pixels.strides
    oft = 0
    bgr = 0
    if zstride == -1:  # BGR layout
        oft = -2
        zstride = 1
        bgr = 1
    assert xstride == 4
    assert zstride == 1
    assert ystride == width*4
    base = pixels.ctypes.data_as(ctypes.c_void_p)
    ptr = ctypes.c_void_p(base.value + oft)
    return ptr, width, height, bgr

def rgba_buffer(p, w, h):
    type = ctypes.c_uint8 * (w * h * 4)
    return ctypes.cast(p, ctypes.POINTER(type)).contents

def cv2_resize_surface(src, dst):
    iptr, iw, ih, ibgr = arr_params(src)
    optr, ow, oh, obgr = arr_params(dst)
    assert ibgr == obgr
    ibuf = rgba_buffer(iptr, iw, ih)
    iarr = np.ndarray((ih, iw, 4), np.uint8, buffer=ibuf)
    obuf = rgba_buffer(optr, ow, oh)
    oarr = np.ndarray((oh, ow, 4), np.uint8, buffer=obuf)
    cv2.resize(iarr, (ow, oh), oarr, interpolation=cv2.INTER_AREA)
</code>

Benchmarking this approach yields a ~100× speedup compared to the original pixels3d path and is only slightly slower than operating on a freshly allocated np.zeros RGBA array.

<code>attached RGBA with cv2 took 0.0097 sec
np.zeros RGBA with cv2 took 0.0066 sec</code>

Safety Considerations

The technique relies on raw C pointers obtained via ctypes . While the surrounding Python code remains safe, the helper functions ( arr_params and rgba_buffer ) are inherently unsafe: misuse can lead to crashes or memory corruption, mirroring Rust’s unsafe blocks.

Python itself has no unsafe keyword, but the combination of ctypes and C libraries provides a similar escape hatch for performance‑critical sections.

Conclusion

Understanding NumPy’s stride mechanics and the memory layout of external libraries such as pygame can reveal hidden performance bottlenecks. By re‑interpreting the underlying buffer with correct strides, one can achieve dramatic speed improvements without rewriting the C code, albeit at the cost of introducing unsafe memory handling that must be carefully managed.

Performancememory layoutopencvUnsafeNumPyPygame
Python Programming Learning Circle
Written by

Python Programming Learning Circle

A global community of Chinese Python developers offering technical articles, columns, original video tutorials, and problem sets. Topics include web full‑stack development, web scraping, data analysis, natural language processing, image processing, machine learning, automated testing, DevOps automation, and big data.

0 followers
Reader feedback

How this landed with the community

login Sign in to like

Rate this article

Was this worth your time?

Sign in to rate
Discussion

0 Comments

Thoughtful readers leave field notes, pushback, and hard-won operational detail here.