Deadlock when trying to open same zarr file with multiple processes #2868

relativityhd · 2025-02-26T19:32:28Z

relativityhd
Feb 26, 2025

Hello together, I have a question about a somewhat specific multiprocessing problem.

Long story short, I run into deadlocks while trying to open the same zarr file with multiple processes, here is some minimal code:

import multiprocessing as mp

import zarr


def worker(i):
    print(f"Stated worker {i}")
    z = zarr.open("data.zarr", mode="r+")
    print(f"Opened store for {i} | {dict(z.attrs)}")
    a = z.attrs["done"]
    a.append(i)
    z.attrs["done"] = a


def main():
    z = zarr.create(
        shape=(10, 10),
        chunks=(5, 5),
        store="data.zarr",
        overwrite=True,
    )
    z.attrs["done"] = []

    p1 = mp.Process(target=worker, args=(1,))
    p2 = mp.Process(target=worker, args=(2,))
    p1.start()
    p2.start()
    p1.join()
    p2.join()

    z = zarr.open("data.zarr", mode="r")
    print(z.attrs["done"])

main()

This outputs

Stated worker 1
Stated worker 2

and then stops doing anything. Reason for this seems to be the concurrent.future.wait function in the zarr sync function, which tries to acquire a threading lock.
Btw. the same happens when I only create (and start) a single process.

Is this intentional behavior? What are possible workarounds or better solutions / approaches?

About my concrete use case

I want to build a multiprocessing AND multithreading capable pipeline (user should choose, ideally this makes the pipeline compatible with e.g. dask or ray) which uses the same underlying datacube as auxiliary data.
This datacube should be procedurally downloaded and filled. The original auxiliary data is stored as e.g. GeoTiffs on a server, hence each download has an id. Each Tiff is then stored to the relevant chunks. It could be seen as a very well organized (thanks to zarr and xarray! <3) and reusable download-cache.
I want to limit the number of concurrent downloads to one but still let processes, which doesn't rely on the current (or queued) download to finish, further process the data. For that I catalogue the queued, currently downloaded and already downloaded ids in the zarr attrs.
And to have access to these, I need to open the zarr array in multiple processes.

Maybe this could help to understand my vision step-by-step:
Imaging I want to process three images a, b and c simultaneously. a depends on the geotiffs 1 and 2, b depends only on 1 and c depends on the geotiffs 2 and 3.
The geotiff 1 is already stored in the datacube.

step	process a	process b	process c	downloading-thread
1	queue-up 2	load 1 from disk	queue-up 3 (and 2) if process a was slower	-
2	wait for download	process b	wait for download	downloading 2
3	load 1 and 2 from disk	-	wait for download	downloading 3
4	process a	-	load 2 and 3 from disk	-
5	-	-	process c	-

This approach works very well in a non-multiprocessed version. I already was able to get it running with multiple threads. However, since threads (in python) are only useful for (network) IO-bounded tasks and not compute bound tasks, I also want to be able to use multiprocessing.

I currently plan to make a library with that functionality, of course I will share it when it's ready. :)

Answered by rabernat

Feb 26, 2025

Zarr by itself is not capable of providing safe concurrent modification of metadata from multiple uncoordinated processes, as in your example. There are inevitable race conditions and deadlocks. It's up to the user's code to avoid these situations.

I would highly recommend exploring Icechunk for this scenario. Icechunk augments Zarr with a transactional storage engine. With Icechunk as your store, each process can commit its changes in a safe way via an ACID transaction.

View full answer

rabernat · 2025-02-26T21:10:39Z

rabernat
Feb 26, 2025
Maintainer

Zarr by itself is not capable of providing safe concurrent modification of metadata from multiple uncoordinated processes, as in your example. There are inevitable race conditions and deadlocks. It's up to the user's code to avoid these situations.

I would highly recommend exploring Icechunk for this scenario. Icechunk augments Zarr with a transactional storage engine. With Icechunk as your store, each process can commit its changes in a safe way via an ACID transaction.

3 replies

rabernat Feb 26, 2025
Maintainer

Also, FWIW, the code didn't deadlock for me. It ran but errored

---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
Cell In[1], line 33
     30     z = zarr.open("data.zarr", mode="r")
     31     print(z.attrs["done"])
---> 33 main()

Cell In[1], line 31
     28 p2.join()
     30 z = zarr.open("data.zarr", mode="r")
---> 31 print(z.attrs["done"])

File ~/mambaforge/envs/icechunk/lib/python3.12/site-packages/zarr/core/attributes.py:21, in Attributes.__getitem__(self, key)
     20 def __getitem__(self, key: str) -> JSON:
---> 21     return self._obj.metadata.attributes[key]

KeyError: 'done'

relativityhd Feb 26, 2025
Author

Yes, you are right, I forgot to copy a line. I edited the original post.

Thank you for the quick answer! Good to know, that zarr is by design not multi-threading capable out of the box.

I know Icechunk, but wanted to reduce the amount of dependencies as much as possible. I will take a closer look, though. What you guys build is pretty cool!

rabernat Feb 26, 2025
Maintainer

that zarr is by design not multi-threading capable out of the box.

It's a bit more subtle than that. I'd put it this way...

You CAN safely read and write Zarr Array data concurrently, provided that you design your chunk writes to avoid conflicts. This is what Zarr was designed for, and is a big upgrade from, say, HDF5.

You CANNOT safely modify array or group metadata concurrently.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Deadlock when trying to open same zarr file with multiple processes #2868

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 3 replies

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Deadlock when trying to open same zarr file with multiple processes #2868

relativityhd Feb 26, 2025

About my concrete use case

Replies: 1 comment · 3 replies

rabernat Feb 26, 2025 Maintainer

rabernat Feb 26, 2025 Maintainer

relativityhd Feb 26, 2025 Author

rabernat Feb 26, 2025 Maintainer

relativityhd
Feb 26, 2025

Replies: 1 comment 3 replies

rabernat
Feb 26, 2025
Maintainer

rabernat Feb 26, 2025
Maintainer

relativityhd Feb 26, 2025
Author

rabernat Feb 26, 2025
Maintainer