-
Notifications
You must be signed in to change notification settings - Fork 181
Description
Environment
- OS: Ubuntu
- Hardware (GPU, or instance type): 4xH200, 4xH100, 8xH100 (runpod instances)
Issue
Race condition in get_shm_prefix causes both the following errors to get raised due to OS not having finished synchronizing shared memory
streaming/streaming/base/shared/prefix.py
Lines 234 to 246 in 223be8f
| try: | |
| shm = SharedMemory(name, False) | |
| except FileNotFoundError: | |
| raise RuntimeError(f'Internal error: shared memory prefix={prefix_int} was not ' + | |
| f'registered by local leader. This may be because you specified ' + | |
| f'different ``local`` parameters from different ranks.') | |
| their_locals, their_prefix_int = _unpack_locals(bytes(shm.buf)) | |
| if streams_local != their_locals or prefix_int != their_prefix_int: | |
| raise RuntimeError(f'Internal error: shared memory registered does not match ' + | |
| f'local leader as streams_local or prefix_int not match. ' + | |
| f'local leader: {their_locals} and {their_prefix_int}. ' + | |
| f'expected: {streams_local} and {prefix_int}.') |
Related: #717 #332 #824 #884 possibly #767
Pasting the contents of my comment from #824 (this contains most of the context and info)
I've been running into the same issue, but I'm only using a single StreamingDataset. I think it's a race condition with the SharedMemory that gets created by the leader here and then accessed by non-leader processes in the next if block:
streaming/streaming/base/shared/prefix.py
Lines 222 to 230 in 223be8f
| if world.is_local_leader: | |
| name = _get_path(prefix_int, LOCALS) | |
| data = _pack_locals(streams_local, prefix_int) | |
| shm = SharedMemory(name, True, len(data)) | |
| shm.buf[:len(data)] = data | |
| if dist.is_available() and dist.is_initialized(): | |
| dist.barrier() | |
I believe this race condition happens because there is some latency for the operating system to fully synchronize shared memory across different processes. The barrier doesn't fix this, because the leader isn't doing anything to wait for the OS-level propagation to happen.
The reason why I think this is the cause is I was able to fix the issue by inserting the following sleep call (at line 230) to give time to synchronize:
if world.is_local_leader:
name = _get_path(prefix_int, LOCALS)
data = _pack_locals(streams_local, prefix_int)
shm = SharedMemory(name, True, len(data))
shm.buf[: len(data)] = data
if dist.is_available() and dist.is_initialized():
dist.barrier()
sleep(1) # this sleep gives time for the OS to update the shm on all processesMore ideal solutions to this would not sleep here and instead would just wait for the OS to propagate to all processes. We could also retry the non-local-leaders block up to some timeout if it fails.
Edit to add:
I may have been hitting the same exceptions as @schopra8 but different reasons, as in my case the error was happening with a single StreamingDataset and commenting out the dist.destroy_process_group() had no impact.
However, I think it's still decently likely they're related issues -- the second dataset or the process group destruction may have been increasing the latency of shared memory propagation.
Also:
This is on Linux with Streaming 0.12.0. Downgrading to 0.9.0 did not affect the issue for me.
Note on a possible general SharedMemory issue
There are other open issues having to do with similar issues happening in places other than get_shm_prefix. I don't have time to look at these so I'm not sure, but it seems likely that they all have a common cause if other code is using SharedMemory similarly to how it is used in get_shm_prefix: creating and writing to SharedMemory does not block until it has synchronized to all processes. In get_shm_prefix it is used as though it is blocking and I suspect it may be used the same way in other parts of the codebase causing similar race conditions. If so, making writes to SharedMemory block until the data are fully propagated would solve multiple issues at once.