Race Condition in base.shared.get_shm_prefix (possible general SharedMemory issue)

## Environment

- OS: Ubuntu
- Hardware (GPU, or instance type): 4xH200, 4xH100, 8xH100 (runpod instances)

## Issue

Race condition in get_shm_prefix causes both the following errors to get raised due to OS not having finished synchronizing shared memory
 https://github.com/mosaicml/streaming/blob/223be8f110ddeda921d8420d673d7da4d556c2c5/streaming/base/shared/prefix.py#L234-L246

Related: #717 #332 #824 #884 possibly #767

## Pasting the contents of my comment from #824 (this contains most of the context and info)
I've been running into the same issue, but I'm only using a single StreamingDataset. I think it's a race condition with the SharedMemory that gets created by the leader here and then accessed by non-leader processes in the next if block:
 https://github.com/mosaicml/streaming/blob/223be8f110ddeda921d8420d673d7da4d556c2c5/streaming/base/shared/prefix.py#L222-L230


I believe this race condition happens because there is some latency for the operating system to fully synchronize shared memory across different processes. The barrier doesn't fix this, because the leader isn't doing anything to wait for the OS-level propagation to happen.

The reason why I think this is the cause is I was able to fix the issue by inserting the following sleep call (at line 230) to give time to synchronize:

```python
    if world.is_local_leader:
        name = _get_path(prefix_int, LOCALS)
        data = _pack_locals(streams_local, prefix_int)
        shm = SharedMemory(name, True, len(data))
        shm.buf[: len(data)] = data

    if dist.is_available() and dist.is_initialized():
        dist.barrier()
    sleep(1) # this sleep gives time for the OS to update the shm on all processes
``` 

More ideal solutions to this would not sleep here and instead would just wait for the OS to propagate to all processes. We could also retry the non-local-leaders block up to some timeout if it fails.

Edit to add:
I may have been hitting the same exceptions as @schopra8 but different reasons, as in my case the error was happening with a single StreamingDataset and commenting out the `dist.destroy_process_group()` had no impact. 
However, I think it's still decently likely they're related issues -- the second dataset or the process group destruction may have been increasing the latency of shared memory propagation.

Also:
This is on Linux with Streaming 0.12.0. Downgrading to 0.9.0 did not affect the issue for me.




## Note on a possible general SharedMemory issue

There are other open issues having to do with similar issues happening in places other than `get_shm_prefix`. I don't have time to look at these so I'm not sure, but it seems likely that they all have a common cause if other code is using SharedMemory similarly to how it is used in get_shm_prefix: creating and writing to SharedMemory does not block until it has synchronized to all processes. In get_shm_prefix it is used as though it is blocking and I suspect it may be used the same way in other parts of the codebase causing similar race conditions. If so, making writes to SharedMemory block until the data are fully propagated would solve multiple issues at once.

	try:
	shm = SharedMemory(name, False)
	except FileNotFoundError:
	raise RuntimeError(f'Internal error: shared memory prefix={prefix_int} was not ' +
	f'registered by local leader. This may be because you specified ' +
	f'different ``local`` parameters from different ranks.')

	their_locals, their_prefix_int = _unpack_locals(bytes(shm.buf))
	if streams_local != their_locals or prefix_int != their_prefix_int:
	raise RuntimeError(f'Internal error: shared memory registered does not match ' +
	f'local leader as streams_local or prefix_int not match. ' +
	f'local leader: {their_locals} and {their_prefix_int}. ' +
	f'expected: {streams_local} and {prefix_int}.')

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Race Condition in base.shared.get_shm_prefix (possible general SharedMemory issue) #901

Environment

Issue

Pasting the contents of my comment from #824 (this contains most of the context and info)

Note on a possible general SharedMemory issue

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

	if world.is_local_leader:
	name = _get_path(prefix_int, LOCALS)
	data = _pack_locals(streams_local, prefix_int)
	shm = SharedMemory(name, True, len(data))
	shm.buf[:len(data)] = data

	if dist.is_available() and dist.is_initialized():
	dist.barrier()

Race Condition in base.shared.get_shm_prefix (possible general SharedMemory issue) #901

Description

Environment

Issue

Pasting the contents of my comment from #824 (this contains most of the context and info)

Note on a possible general SharedMemory issue

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions