Why You Should Save NumPy Arrays with Zarr

Read and Write Arrays Faster with Dask

Richard Pelgrim

Towards Data Science

· ~5 min read · January 5, 2022 (Updated: February 11, 2022) · Free: No

tl;dr

This post tells you why and how to use the Zarr format to save your NumPy arrays. It walks you through the code to read and write large NumPy arrays in parallel using Zarr and Dask.

Here's the code if you want to jump right in. If you have questions about the code, reach out to me on Twitter.

Common Ways to Save NumPy Arrays

Three common ways to store NumPy arrays are to store them as .csv files , as .txt files or as .npy files.

Each of these has important limitations:

CSV and TXT files are human-readable formats that can't contain NumPy arrays larger than 2-dimensions.
The native NPY binary file format doesn't support parallel read/write operations.

Let's see this in action below.

We'll start by creating a dummy NumPy array to work with. We'll use np.random.rand to generate two arrays populated with random numbers, one with 2 dimensions and one with 3 dimensions:

import numpy as np 
array_XS = np.random.rand(3,2) 
array_L = np.random.rand(1000, 1000, 100)

Storing the 3-dimensional array_L as .txt or .csv will throw a ValueError:

np.savetxt('array_L.txt', array_L, delimiter=" ") np.savetxt('array_L.csv', array_L, delimiter=",") 
ValueError: Expected 1D or 2D array, got 3D array instead

You could store the 3-dimensional array as a .npy file. This works but can't be scaled to larger-than-memory datasets or other situations in which you want to read and/or write in parallel.

np.save('array_L.npy', array_L)

Save NumPy Arrays with Zarr

Instead of the three options above, consider saving your NumPy arrays to Zarr, a format for the storage of chunked, compressed, N-dimensional arrays.

The three most important benefits of Zarr are that it:

1. Has multiple compression options and levels built-in

2. Supports multiple backend data stores (zip, S3, etc.)

3. Can read and write data in parallel* in n-dimensional compressed chunks

Zarr has also been widely adopted across PyData libraries like Dask, TensorStore, and x-array, which means that you will see significant performance gains when using this file format together with supported libraries.

* Zarr supports concurrent reads and concurrent writes separately but not concurrent reads and writes at the same time.

Compress NumPy Arrays with Zarr

Let's see Zarr's compression options in action. Below, we'll save the small and large arrays to .zarr and check the resulting file sizes.

import zarr 
# save small NumPy array to zarr 
zarr.save('array_XS.zarr', array_XS) 
# get the size (in bytes) of the stored .zarr file 
! stat -f '%z' array_XS.zarr 
>> 128 
# save large NumPy array to zarr 
zarr.save('array_L.zarr', array_L) 
# get the size of the stored .zarr directory 
! du -h array_L.zarr >> 693M	array_L.zarr

Storing the array_L as Zarr leads to a significant reduction in filesize (~15% for array_L) even with just the default out-of-the-box compression settings. Check out the accompanying notebook for more compression options you can tweak to increase performance.

Loading NumPy Arrays with Zarr

You can load arrays stored as .zarr back into your Python session using zarr.load().

# load in array from zarr 
array_zarr = zarr.load('array_L.zarr')

It will be loaded in as a regular NumPy array.

type(array_zarr) 
>>> numpy.ndarray

Zarr supports multiple backend data stores. This means you can also easily load .zarr files from cloud-based data stores, like Amazon S3:

# load small zarr array from S3 
array_S = zarr.load(
"s3://coiled-datasets/synthetic-data/array-random-390KB.zarr"
)

Read and Write NumPy Arrays in Parallel with Zarr and Dask

If you're working with data stored in the cloud, chances are that your data is larger than your local machine memory. In that case, you can use Dask to read and write your large Zarr arrays in parallel.

Below we try to load a 370GB .zarr file into our Python session directly:

array_XL = zarr.load(
"s3://coiled-datasets/synthetic-data/array-random-370GB.zarr"
)

This fails with the following error:

MemoryError: Unable to allocate 373. GiB for an array with shape (10000, 10000, 500) and data type float64

Loading the same 370GB .zarr file into a Dask array works fine:

dask_array = da.from_zarr(
"s3://coiled-datasets/synthetic-data/array-random-370GB.zarr"
)

This is because Dask evaluates lazily. The array is not read into memory until you specifically instruct Dask to perform a computation on the dataset. Read more about the basics of Dask here.

This means you can perform some computations on this dataset locally. But loading the entire array into local memory will still fail because your machine does not have enough memory.

NOTE: Even if your machine may technically have the storage resources to spill this dataset to disk, this will significantly reduce performance.

Common Mistakes to Avoid when Using Dask

Using Dask for the first time can be a steep learning curve. After years of building Dask and guiding people through…

coiled.io

Scale to Dask Cluster with Coiled

We'll need to run this on a Dask cluster in the cloud to access additional hardware resources.

To do this:

Spin up a Coiled Cluster

cluster = coiled.Cluster(
    name="create-synth-array", 
    software="coiled-examples/numpy-zarr", 
    n_workers=50, worker_cpu=4, 
    worker_memory="24Gib", 
    backend_options={'spot':'True'}, 
)

2. Connect Dask to this cluster

from distributed import Client 
client = Client(cluster)

3. And then run computations over the entire cluster comfortably.

# load data into array
da_1 = da.from_zarr(
"s3://coiled-datasets/synthetic-data/array-random-370GB.zarr"
) 
# run computation over entire array (transpose)
da_2 = da_1.T da_2
%%time 
da_2.to_zarr(
"s3://coiled-datasets/synthetic-data/array-random-370GB-T.zarr"
) 
CPU times: user 2.26 s, sys: 233 ms, total: 2.49 s 
Wall time: 1min 32s

Our Coiled cluster has 50 Dask workers with 24GB RAM each, all running a pre-compiled software environment containing the necessary dependencies. This means we have enough resources to comfortably transpose the array and write it back to S3.

Dask is able to do all of this for us in parallel and without ever loading the array into our local memory. It has loaded, transformed, and saved an array of 372GB back to S3 in less than 2 minutes.

Saving NumPy Arrays Summary

Let's recap:

There are important limitations to many of the common ways of storing NumPy arrays.
The Zarr file format offers powerful compression options, supports multiple data store backends, and can read/write your NumPy arrays in parallel.
Dask allows you to use these parallel read/write capabilities to their full potential.
Connecting Dask to an on-demand Coiled cluster allows for efficient computations over larger-than-memory datasets.

Follow me on LinkedIn for regular data science and machine learning updates and hacks.

Originally published at https://coiled.io on January 5, 2022.

#data-science #python #numpy #data #data-engineering