utils: implement windows epoll/eventfd#609
utils: implement windows epoll/eventfd#609lstocchi wants to merge 2 commits intocontainers:whpx-wipfrom
Conversation
This introduces an epoll-compatible polling abstraction for Windows by leveraging I/O Completion Ports (IOCP) as a central event multiplexer. By bypassing the standard `WaitForMultipleObjects` 64-handle limit, this architecture gives us true O(1) wake-ups with no handle-count limitations. When an event is added to the epoll, we heap-allocate a `Watch` struct and attach a Wait Completion Packet (WCP) to it. The raw memory pointer of this struct is passed to the kernel as the completion key. When the event signals, the kernel pushes the WCP to the IOCP queue. The waiting thread pops the packet and reads the pointer, allowing us to process events with zero heap allocations and a completely lock-free hot path. When deleting an event, the WCP is closed and the `Watch` memory is moved to a "zombie" list with a 5-second garbage collection delay. Because the IOCP queue is managed asynchronously by the Windows kernel, a deleted event might still have a completion packet in-flight to a worker thread. This GC window safely drains those ghost packets before freeing the memory, preventing Use-After-Free segfaults. One calculated tradeoff is the reliance on the undocumented Windows NT native API (`NtAssociateWaitCompletionPacket`). While unofficial, it has been stable in the Windows kernel for over a decade and is currently the only way to achieve VMM-grade, O(1) polling performance for arbitrary handles. Additional details: - EventSet bit values mirror the macOS implementation for cross-platform portability. - Adds `windows-sys` as a Windows-only dependency for OS APIs. Signed-off-by: lstocchi <lstocchi@redhat.com>
Emulate eventfd on Windows using a manual-reset kernel Event object paired with a Mutex-protected counter. To maximize VMM throughput, the `write` path only trigger the event when the counter transitions from `0 -> non-zero`. If a virtual device rapid-fires multiple interrupts before the vCPU wakes up, we accumulate the data in user-space RAM and skip the redundant kernel syscalls entirely. `read` and `wait_timeout` maintain strict level-triggered synchronization. The kernel event is only reset (`ResetEvent`) when the internal counter is fully drained, preventing the IOCP epoll loop from entering an infinite busy-wait cycle. Signed-off-by: lstocchi <lstocchi@redhat.com>
|
cc @germag |
| } | ||
|
|
||
| let entries_slice = unsafe { | ||
| std::slice::from_raw_parts(entries.as_ptr().cast::<OVERLAPPED_ENTRY>(), count as usize) |
There was a problem hiding this comment.
This is a bit cleaner:
| std::slice::from_raw_parts(entries.as_ptr().cast::<OVERLAPPED_ENTRY>(), count as usize) | |
| entries[..count as usize].assume_init_ref() |
(Requires recent Rust 1.93 though, but that shouldn't be a problem)
There was a problem hiding this comment.
Good one 👍 Just want to mention that this is going to a WIP branch to test how it works but it can be definitely enhanced before pushed to main.
| if !watch.is_active.load(Ordering::Acquire) { | ||
| continue; | ||
| } | ||
|
|
||
| let current_events = watch.events.load(Ordering::Acquire); | ||
| let event_set = EventSet::from_bits_truncate(current_events); | ||
|
|
||
| events[result_count] = EpollEvent { | ||
| events: (event_set & (EventSet::IN | EventSet::OUT)).bits(), | ||
| u64: watch.data.load(Ordering::Acquire), | ||
| }; | ||
| result_count += 1; | ||
|
|
||
| if !event_set.contains(EventSet::EDGE_TRIGGERED) { | ||
| // Level-triggered: re-associate the WCP so the next signal | ||
| // on this handle produces another completion packet. | ||
| let _ = associate_wcp(watch.wcp, iocp_handle, watch.fd, watch_ptr as *mut _); | ||
| } |
There was a problem hiding this comment.
Isn't there a race between CloseHandle(watch.wcp) in Epoll::ctl(ControlOperation::Delete, ...) and associate_wcp(watch.wcp, ...) in Epoll::wait()?
In wait(), we first check watch.is_active and afterwards may call associate_wcp(watch.wcp, ...), but that that does not protect the wcp handle itself. Another thread can enter Epoll::ctl(ControlOperation::Delete, ...) in between those two steps, mark the watch inactive, and call CloseHandle(watch.wcp). At that point, the wait() thread still uses the old watch.wcp value and may pass it to associate_wcp() even though the handle has already been closed.
There was a problem hiding this comment.
Thanks for reviewing it @mtjhrc :)
Yes, you're right, there is a race there but it's a calculated (maybe wrongly) one. The theory is that if we get to associate_wcp with a closed wcp handle, NtAssociateWaitCompletionPacket will just return STATUS_INVALID_HANDLE and we silently drop it. Another thing that may happen is that we close the handle and it gets reused by Windows just before we call NtAssociateWaitCompletionPacket, but even in that case if the handle is assigned to any other object different to WCP it silently fails with STATUS_OBJECT_TYPE_MISMATCH.
But, as the event was deleted, we don't care about a failure as it should not be queued in any case. WDYT?
I haven't found a better way to keep the wait() lock-free without any trade-off.
There was a problem hiding this comment.
Thanks for reviewing it @mtjhrc :)
I know this is going into a WIP branch, but I've never programmed Windows, so I was curious how this looks like! 😃 Also I was thinking of cleaning up the EventFd/Epoll abstraction and the macOS shim so I was curious if this changes things.
... but even in that case if the handle is assigned to any other object different to WCP it silently fails with STATUS_OBJECT_TYPE_MISMATCH.
That makes sense, but still - couldn't another thread create a new wait-completion-packet object and get the same recycled handle value? In that case the type would still match so NtAssociateWaitCompletionPacket() would NOT fail, right? In that case we would end up using a different wcp.
I am pretty sure there must be a low overhead (compared to the syscalls) way to do this e.g. some slightly more complex state transition with atomics (possibly using WaitOnAddress/WakeByAddress).
It's fine for now, but it should really have a comment explaining the calculated race.
There was a problem hiding this comment.
That makes sense, but still - couldn't another thread create a new wait-completion-packet object and get the same recycled handle value? In that case the type would still match so
NtAssociateWaitCompletionPacket()would NOT fail, right? In that case we would end up using a different wcp.
Yes, that would cause a misbehavior but in this PR i think we should be safe.
- Thread A is in wait and it's in the middle of associating the wcp.
- Thread B deletes the wcp handle.
- Thread C creates a new wcp and windows assigns the same handle.
- Thread A associates wcp to the watch object again and it succeed.
- Thread C will try to associate wcp to its watch and it fails and it will deletes the WCP handle it created. -> https://github.com/containers/libkrun/pull/609/changes#diff-c76c1e69b53fd8bb8ced70bad6a1bb8d17eceedc7a8c82f8fb1ddb977f1f8c7cR272-R278
- At this point, Thread C failed and Thread A associated the WCP. The kernel will delete the WCP as it has a reference count to 0 (the handle was deleted by thread C and it was never pushed to the iocp queue because the event never triggered -> a delete was request by thread B)
The only thing that we lose is the event tried to be added by Thread C, and hopefully it will be requeued, but nothing should mess the system other than that.
I am pretty sure there must be a low overhead (compared to the syscalls) way to do this e.g. some slightly more complex state transition with atomics (possibly using
WaitOnAddress/WakeByAddress).
My concern is with the GC. I'd be super happy if we can remove it somehow and i guess it would solve some edge cases.
It's fine for now, but it should really have a comment explaining the calculated race.
+1 i'll add a comment
This PR is part of the effort to make libkrun working on Windows.
It includes 2 commits about the implementation of epoll and eventfd.
Epoll is backed up by IOCP and it leverages Wait Completion Packet to make the Kernel handling events.
Eventfd associates an event to a counter to try replicating what's done on othjer platforms.
More details in each commits.