utils: implement windows epoll/eventfd by lstocchi · Pull Request #609 · containers/libkrun

lstocchi · 2026-03-26T18:14:29Z

This PR is part of the effort to make libkrun working on Windows.

It includes 2 commits about the implementation of epoll and eventfd.

Epoll is backed up by IOCP and it leverages Wait Completion Packet to make the Kernel handling events.
Eventfd associates an event to a counter to try replicating what's done on othjer platforms.

More details in each commits.

This introduces an epoll-compatible polling abstraction for Windows by leveraging I/O Completion Ports (IOCP) as a central event multiplexer. By bypassing the standard `WaitForMultipleObjects` 64-handle limit, this architecture gives us true O(1) wake-ups with no handle-count limitations. When an event is added to the epoll, we heap-allocate a `Watch` struct and attach a Wait Completion Packet (WCP) to it. The raw memory pointer of this struct is passed to the kernel as the completion key. When the event signals, the kernel pushes the WCP to the IOCP queue. The waiting thread pops the packet and reads the pointer, allowing us to process events with zero heap allocations and a completely lock-free hot path. When deleting an event, the WCP is closed and the `Watch` memory is moved to a "zombie" list with a 5-second garbage collection delay. Because the IOCP queue is managed asynchronously by the Windows kernel, a deleted event might still have a completion packet in-flight to a worker thread. This GC window safely drains those ghost packets before freeing the memory, preventing Use-After-Free segfaults. One calculated tradeoff is the reliance on the undocumented Windows NT native API (`NtAssociateWaitCompletionPacket`). While unofficial, it has been stable in the Windows kernel for over a decade and is currently the only way to achieve VMM-grade, O(1) polling performance for arbitrary handles. Additional details: - EventSet bit values mirror the macOS implementation for cross-platform portability. - Adds `windows-sys` as a Windows-only dependency for OS APIs. Signed-off-by: lstocchi <lstocchi@redhat.com>

Emulate eventfd on Windows using a manual-reset kernel Event object paired with a Mutex-protected counter. To maximize VMM throughput, the `write` path only trigger the event when the counter transitions from `0 -> non-zero`. If a virtual device rapid-fires multiple interrupts before the vCPU wakes up, we accumulate the data in user-space RAM and skip the redundant kernel syscalls entirely. `read` and `wait_timeout` maintain strict level-triggered synchronization. The kernel event is only reset (`ResetEvent`) when the internal counter is fully drained, preventing the IOCP epoll loop from entering an infinite busy-wait cycle. Signed-off-by: lstocchi <lstocchi@redhat.com>

lstocchi · 2026-03-26T18:15:29Z

cc @germag

mtjhrc · 2026-03-27T11:35:02Z

src/utils/src/windows/epoll.rs

+        }
+
+        let entries_slice = unsafe {
+            std::slice::from_raw_parts(entries.as_ptr().cast::<OVERLAPPED_ENTRY>(), count as usize)


This is a bit cleaner:

Suggested change

std::slice::from_raw_parts(entries.as_ptr().cast::<OVERLAPPED_ENTRY>(), count as usize)

entries[..count as usize].assume_init_ref()

(Requires recent Rust 1.93 though, but that shouldn't be a problem)

Good one 👍 Just want to mention that this is going to a WIP branch to test how it works but it can be definitely enhanced before pushed to main.

mtjhrc · 2026-03-27T13:12:06Z

src/utils/src/windows/epoll.rs

+            if !watch.is_active.load(Ordering::Acquire) {
+                continue;
+            }
+
+            let current_events = watch.events.load(Ordering::Acquire);
+            let event_set = EventSet::from_bits_truncate(current_events);
+
+            events[result_count] = EpollEvent {
+                events: (event_set & (EventSet::IN | EventSet::OUT)).bits(),
+                u64: watch.data.load(Ordering::Acquire),
+            };
+            result_count += 1;
+
+            if !event_set.contains(EventSet::EDGE_TRIGGERED) {
+                // Level-triggered: re-associate the WCP so the next signal
+                // on this handle produces another completion packet.
+                let _ = associate_wcp(watch.wcp, iocp_handle, watch.fd, watch_ptr as *mut _);
+            }


Isn't there a race between CloseHandle(watch.wcp) in Epoll::ctl(ControlOperation::Delete, ...) and associate_wcp(watch.wcp, ...) in Epoll::wait()?

In wait(), we first check watch.is_active and afterwards may call associate_wcp(watch.wcp, ...), but that that does not protect the wcp handle itself. Another thread can enter Epoll::ctl(ControlOperation::Delete, ...) in between those two steps, mark the watch inactive, and call CloseHandle(watch.wcp). At that point, the wait() thread still uses the old watch.wcp value and may pass it to associate_wcp() even though the handle has already been closed.

Thanks for reviewing it @mtjhrc :)

Yes, you're right, there is a race there but it's a calculated (maybe wrongly) one. The theory is that if we get to associate_wcp with a closed wcp handle, NtAssociateWaitCompletionPacket will just return STATUS_INVALID_HANDLE and we silently drop it. Another thing that may happen is that we close the handle and it gets reused by Windows just before we call NtAssociateWaitCompletionPacket, but even in that case if the handle is assigned to any other object different to WCP it silently fails with STATUS_OBJECT_TYPE_MISMATCH.
But, as the event was deleted, we don't care about a failure as it should not be queued in any case. WDYT?

I haven't found a better way to keep the wait() lock-free without any trade-off.

Thanks for reviewing it @mtjhrc :)

I know this is going into a WIP branch, but I've never programmed Windows, so I was curious how this looks like! 😃 Also I was thinking of cleaning up the EventFd/Epoll abstraction and the macOS shim so I was curious if this changes things.

... but even in that case if the handle is assigned to any other object different to WCP it silently fails with STATUS_OBJECT_TYPE_MISMATCH.

That makes sense, but still - couldn't another thread create a new wait-completion-packet object and get the same recycled handle value? In that case the type would still match so NtAssociateWaitCompletionPacket() would NOT fail, right? In that case we would end up using a different wcp.

I am pretty sure there must be a low overhead (compared to the syscalls) way to do this e.g. some slightly more complex state transition with atomics (possibly using WaitOnAddress/WakeByAddress).

It's fine for now, but it should really have a comment explaining the calculated race.

That makes sense, but still - couldn't another thread create a new wait-completion-packet object and get the same recycled handle value? In that case the type would still match so NtAssociateWaitCompletionPacket() would NOT fail, right? In that case we would end up using a different wcp.

Yes, that would cause a misbehavior but in this PR i think we should be safe.

Thread A is in wait and it's in the middle of associating the wcp.

Thread B deletes the wcp handle.

Thread C creates a new wcp and windows assigns the same handle.

Thread A associates wcp to the watch object again and it succeed.

Thread C will try to associate wcp to its watch and it fails and it will deletes the WCP handle it created. -> https://github.com/containers/libkrun/pull/609/changes#diff-c76c1e69b53fd8bb8ced70bad6a1bb8d17eceedc7a8c82f8fb1ddb977f1f8c7cR272-R278

At this point, Thread C failed and Thread A associated the WCP. The kernel will delete the WCP as it has a reference count to 0 (the handle was deleted by thread C and it was never pushed to the iocp queue because the event never triggered -> a delete was request by thread B)

The only thing that we lose is the event tried to be added by Thread C, and hopefully it will be requeued, but nothing should mess the system other than that.

I am pretty sure there must be a low overhead (compared to the syscalls) way to do this e.g. some slightly more complex state transition with atomics (possibly using WaitOnAddress/WakeByAddress).

My concern is with the GC. I'd be super happy if we can remove it somehow and i guess it would solve some edge cases.

It's fine for now, but it should really have a comment explaining the calculated race.

+1 i'll add a comment

lstocchi added 2 commits March 26, 2026 19:06

lstocchi requested review from MatiasVara, dorindabassey, jakecorrenti, mtjhrc, slp and tylerfanelli as code owners March 26, 2026 18:14

mtjhrc reviewed Mar 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

utils: implement windows epoll/eventfd#609

utils: implement windows epoll/eventfd#609
lstocchi wants to merge 2 commits intocontainers:whpx-wipfrom
lstocchi:windows_epoll

lstocchi commented Mar 26, 2026

Uh oh!

lstocchi commented Mar 26, 2026

Uh oh!

mtjhrc Mar 27, 2026

Uh oh!

lstocchi Mar 27, 2026

Uh oh!

mtjhrc Mar 27, 2026

Uh oh!

lstocchi Mar 27, 2026

Uh oh!

mtjhrc Mar 27, 2026 •

edited

Loading

Uh oh!

lstocchi Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	std::slice::from_raw_parts(entries.as_ptr().cast::<OVERLAPPED_ENTRY>(), count as usize)
	entries[..count as usize].assume_init_ref()

Conversation

lstocchi commented Mar 26, 2026

Uh oh!

lstocchi commented Mar 26, 2026

Uh oh!

mtjhrc Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

lstocchi Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

mtjhrc Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

lstocchi Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

mtjhrc Mar 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

lstocchi Mar 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

mtjhrc Mar 27, 2026 •

edited

Loading