Skip to content

StreamingDataset gives FileExistsError when called with multiprocessing #884

@nicolasj92

Description

@nicolasj92

Environment

  • OS: Ubuntu 22.04
  • Hardware (GPU, or instance type): V100

To reproduce

Steps to reproduce the behavior:

import os
import multiprocessing
from streaming import Stream, StreamingDataset

def process_task(_):
    streams_list = [Stream(local=os.fspath(DATASET_PATH))]
    stream_dataset = StreamingDataset(streams=streams_list)

if __name__ == '__main__':
    num_iterations = 20  # Number of times you want to run the code concurrently
    with multiprocessing.Pool(processes=4) as pool:
        pool.map(process_task, range(num_iterations))

The above code is just for illustrative purposes, to reproduce what would happen if you have multiple tests that instantiate StreamingDataset and run them in parallel.

Expected behavior

Not give me FileExistsError: [Errno 17] File exists: '/000002_locals'.

Additional context

This happens for us because we have several tests in our repository that initialise instances of StreamingDataset. They are run using pytest-xdist with multiple workers and thus sometimes cause the above FileExistsError. It seems that the prefix integers for the shared memory files collide in a sort of race condition when this happens. I am not sure if this is considered a bug in Streaming or if this is just unsupported usage of this functionality.
The only workaround I have found is to run the tests sequentially which is not ideal.

This is likely related (but not identical to):
#767
#717

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions