Why Use Zarr? Storing and Accessing Large NumPy Arrays with mmap and Zarr
Zarr provides a modern, chunked and compressed storage format that lets you treat massive NumPy arrays like in‑memory objects, offering on‑demand loading, flexible back‑ends (disk, S3, zip), automatic caching, resizing, parallel reads/writes, and superior performance compared to traditional mmap‑based memmap files.
When dealing with NumPy arrays that are too large to fit into memory at once, you can process them in chunks. This can be done transparently or by loading one block at a time from disk. In either case, the array must be stored on disk.
Use mmap() via the numpy.memmap() API to treat a file on disk as if it were fully in memory.
Use Zarr or HDF5, which are similar storage formats that allow on‑demand loading and storage of compressed array chunks.
Cache Mechanism Advantages
Improved read speed: When data is first read from disk into memory, a copy is cached by the OS. Subsequent reads can be served from the cache, greatly speeding up access.
Reduced disk access: Cached data avoids frequent disk I/O, lowering disk load and wear.
Better system responsiveness: Caching accelerates data access, making the system more responsive, especially for I/O‑heavy applications.
Automatic memory management: The cache automatically evicts data based on policies when memory is needed for other tasks.
Method 1: mmap
mmap() maps a file on disk into memory, allowing you to treat the file as a NumPy array. The OS transparently reads/writes data based on whether it is cached.
If the data is in cache, you can access it directly.
If the data resides only on disk, access is slower, but you do not need to manage the loading yourself.
NumPy provides built‑in support for mmap() via np.memmap() :
import numpy as np
array = np.memmap("mydata/myarray.arr", mode="r", dtype=np.int16, shape=(1024, 1024))Running this code yields an array that transparently returns data from the cache or reads from disk as needed.
Limitations of mmap()
Data must reside on a traditional file system; it cannot be loaded from object stores such as AWS S3.
Disk I/O can become a bottleneck when loading large amounts of data, because disks are orders of magnitude slower than RAM.
For N‑dimensional arrays, only slices aligned with the default storage layout are fast; other slices may require extensive disk reads.
Method 2: Zarr
Zarr is a more modern and flexible format, though it has fewer language bindings outside Python. It is generally a better choice than HDF5 unless you need multi‑language support.
Zarr stores data in compressed chunks that can be loaded into memory on demand and written back to disk.
Example of loading an array with Zarr:
import zarr
import numpy as np
z = zarr.open('example.zarr', mode='a', shape=(1024, 1024), chunks=(512, 512), dtype=np.int16)
type(z)
#
type(z[100:200])
#Zarr overcomes the limitations of mmap() :
You can store chunks on local disk, AWS S3, or any key/value store.
Chunk size and shape are user‑defined, allowing efficient reads across multiple axes (also applicable to HDF5).
Chunks can be compressed, similar to HDF5.
Zarr Introduction
Zarr is a format for storing chunked, compressed N‑dimensional arrays, inspired by HDF5, h5py, and bcolz.
https://zarr.readthedocs.io/en/stable/index.html
Create N‑dimensional arrays from any NumPy data.
Chunk arrays along any dimension.
Store arrays in memory, on disk, inside zip files, on S3, etc.
Concurrent reads from multiple threads or processes.
Concurrent writes from multiple threads or processes.
Zarr Usage
https://zarr.readthedocs.io/en/stable/tutorial.html
Creating Arrays
Zarr provides several functions for array creation.
import zarr
z = zarr.zeros((10000, 10000), chunks=(1000, 1000), dtype='i4')
print(z)The above creates a 2‑D array of 10,000 × 10,000 32‑bit integers, chunked into 1,000 × 1,000 blocks (100 blocks total).
For a full list of creation routines, see the zarr.creation module.
Reading and Writing Data
Zarr arrays support a NumPy‑like interface for I/O. You can fill the entire array with a scalar:
z[:] = 42Or write to specific regions:
import numpy as np
z[0, :] = np.arange(10000)
z[:, 0] = np.arange(10000)Retrieving data via slicing loads the requested region into memory as a NumPy array:
print(z[0, 0])
print(z[-1, -1])
print(z[0, :])
print(z[:, 0])
print(z[:])Persisting Arrays
By default, chunks are kept in memory. To persist data across sessions, store the array on a filesystem:
z1 = zarr.open('data/example.zarr', mode='w', shape=(10000, 10000),
chunks=(1000, 1000), dtype='i4')The directory data/example.zarr holds metadata and compressed chunk data. The open function does not require an explicit close; data is flushed automatically.
Reading back the persisted array:
z2 = zarr.open('data/example.zarr', mode='r')
print(np.all(z1[:] == z2[:]))Resizing and Appending
Zarr arrays can be resized, adding or removing length along any dimension:
z = zarr.zeros(shape=(10000, 10000), chunks=(1000, 1000))
z[:] = 42
z.resize(20000, 10000)
print(z.shape)When resized smaller, blocks outside the new shape are deleted.
The append() method adds data along a specified axis:
a = np.arange(10000000, dtype='i4').reshape(10000, 1000)
z = zarr.array(a, chunks=(1000, 100))
print(z.shape)
z.append(a)
print(z.shape)
z.append(np.vstack([a, a]), axis=1)
print(z.shape)Compressors
Zarr supports various compressors via the compressor keyword.
from numcodecs import Blosc
compressor = Blosc(cname='zstd', clevel=3, shuffle=Blosc.BITSHUFFLE)
data = np.arange(100000000, dtype='i4').reshape(10000, 10000)
z = zarr.array(data, chunks=(1000, 1000), compressor=compressor)
print(z.compressor)By default Zarr uses Blosc. Other compressors such as Zstandard or LZMA can be selected, and custom filter pipelines (e.g., delta filtering) are also possible.
Filters
Applying transformations before compression can improve compression ratios. Example using a delta filter with Blosc:
from numcodecs import Blosc, Delta
filters = [Delta(dtype='i4')]
compressor = Blosc(cname='zstd', clevel=1, shuffle=Blosc.SHUFFLE)
data = np.arange(100000000, dtype='i4').reshape(10000, 10000)
z = zarr.array(data, chunks=(1000, 1000), filters=filters, compressor=compressor)
print(z.info)Groups
Zarr supports hierarchical organization of arrays via groups, similar to HDF5.
root = zarr.group()
print(root)
foo = root.create_group('foo')
bar = foo.create_group('bar')
z1 = bar.zeros('baz', shape=(10000, 10000), chunks=(1000, 1000), dtype='i4')
print(z1)Groups can be accessed using path syntax (e.g., root['foo/bar'] ) and can be opened with zarr.open for convenient filesystem‑based hierarchies.
Chunk Optimization
Chunks of at least ~1 MiB uncompressed often give better performance with Blosc. The optimal chunk shape depends on access patterns; choose chunking that aligns with the dimensions you slice most frequently.
Parallel Computing
Zarr arrays are designed for parallel read/write scenarios. Multiple threads or processes can read concurrently, and concurrent writes are possible when each writer updates a distinct region. Compression/decompression typically releases the GIL, so Zarr does not block other Python threads.
Sohu Tech Products
A knowledge-sharing platform for Sohu's technology products. As a leading Chinese internet brand with media, video, search, and gaming services and over 700 million users, Sohu continuously drives tech innovation and practice. We’ll share practical insights and tech news here.
How this landed with the community
Was this worth your time?
0 Comments
Thoughtful readers leave field notes, pushback, and hard-won operational detail here.