Skip to content

utils: implement windows epoll/eventfd#609

Open
lstocchi wants to merge 2 commits intocontainers:whpx-wipfrom
lstocchi:windows_epoll
Open

utils: implement windows epoll/eventfd#609
lstocchi wants to merge 2 commits intocontainers:whpx-wipfrom
lstocchi:windows_epoll

Conversation

@lstocchi
Copy link
Copy Markdown

This PR is part of the effort to make libkrun working on Windows.

It includes 2 commits about the implementation of epoll and eventfd.

Epoll is backed up by IOCP and it leverages Wait Completion Packet to make the Kernel handling events.
Eventfd associates an event to a counter to try replicating what's done on othjer platforms.

More details in each commits.

This introduces an epoll-compatible polling abstraction for Windows by
leveraging I/O Completion Ports (IOCP) as a central event multiplexer.
By bypassing the standard `WaitForMultipleObjects` 64-handle limit, this
architecture gives us true O(1) wake-ups with no handle-count limitations.

When an event is added to the epoll, we heap-allocate a `Watch` struct
and attach a Wait Completion Packet (WCP) to it. The raw memory pointer
of this struct is passed to the kernel as the completion key. When the
event signals, the kernel pushes the WCP to the IOCP queue. The waiting
thread pops the packet and reads the pointer, allowing us to process
events with zero heap allocations and a completely lock-free hot path.

When deleting an event, the WCP is closed and the `Watch` memory is moved
to a "zombie" list with a 5-second garbage collection delay. Because the
IOCP queue is managed asynchronously by the Windows kernel, a deleted
event might still have a completion packet in-flight to a worker thread.
This GC window safely drains those ghost packets before freeing the memory,
preventing Use-After-Free segfaults.

One calculated tradeoff is the reliance on the undocumented Windows NT
native API (`NtAssociateWaitCompletionPacket`). While unofficial, it has
been stable in the Windows kernel for over a decade and is currently the
only way to achieve VMM-grade, O(1) polling performance for arbitrary handles.

Additional details:
- EventSet bit values mirror the macOS implementation for cross-platform portability.
- Adds `windows-sys` as a Windows-only dependency for OS APIs.

Signed-off-by: lstocchi <lstocchi@redhat.com>
Emulate eventfd on Windows using a manual-reset kernel Event object paired with a Mutex-protected counter.

To maximize VMM throughput, the `write` path only trigger the event when the counter transitions from `0 -> non-zero`.
If a virtual device rapid-fires multiple interrupts before the vCPU wakes up,
we accumulate the data in user-space RAM and skip the redundant kernel syscalls entirely.

`read` and `wait_timeout` maintain strict level-triggered synchronization.
The kernel event is only reset (`ResetEvent`) when the internal counter is fully drained,
preventing the IOCP epoll loop from entering an infinite busy-wait cycle.

Signed-off-by: lstocchi <lstocchi@redhat.com>
@lstocchi
Copy link
Copy Markdown
Author

cc @germag

}

let entries_slice = unsafe {
std::slice::from_raw_parts(entries.as_ptr().cast::<OVERLAPPED_ENTRY>(), count as usize)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit cleaner:

Suggested change
std::slice::from_raw_parts(entries.as_ptr().cast::<OVERLAPPED_ENTRY>(), count as usize)
entries[..count as usize].assume_init_ref()

(Requires recent Rust 1.93 though, but that shouldn't be a problem)

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good one 👍 Just want to mention that this is going to a WIP branch to test how it works but it can be definitely enhanced before pushed to main.

Comment on lines +391 to +408
if !watch.is_active.load(Ordering::Acquire) {
continue;
}

let current_events = watch.events.load(Ordering::Acquire);
let event_set = EventSet::from_bits_truncate(current_events);

events[result_count] = EpollEvent {
events: (event_set & (EventSet::IN | EventSet::OUT)).bits(),
u64: watch.data.load(Ordering::Acquire),
};
result_count += 1;

if !event_set.contains(EventSet::EDGE_TRIGGERED) {
// Level-triggered: re-associate the WCP so the next signal
// on this handle produces another completion packet.
let _ = associate_wcp(watch.wcp, iocp_handle, watch.fd, watch_ptr as *mut _);
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't there a race between CloseHandle(watch.wcp) in Epoll::ctl(ControlOperation::Delete, ...) and associate_wcp(watch.wcp, ...) in Epoll::wait()?

In wait(), we first check watch.is_active and afterwards may call associate_wcp(watch.wcp, ...), but that that does not protect the wcp handle itself. Another thread can enter Epoll::ctl(ControlOperation::Delete, ...) in between those two steps, mark the watch inactive, and call CloseHandle(watch.wcp). At that point, the wait() thread still uses the old watch.wcp value and may pass it to associate_wcp() even though the handle has already been closed.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing it @mtjhrc :)

Yes, you're right, there is a race there but it's a calculated (maybe wrongly) one. The theory is that if we get to associate_wcp with a closed wcp handle, NtAssociateWaitCompletionPacket will just return STATUS_INVALID_HANDLE and we silently drop it. Another thing that may happen is that we close the handle and it gets reused by Windows just before we call NtAssociateWaitCompletionPacket, but even in that case if the handle is assigned to any other object different to WCP it silently fails with STATUS_OBJECT_TYPE_MISMATCH.
But, as the event was deleted, we don't care about a failure as it should not be queued in any case. WDYT?

I haven't found a better way to keep the wait() lock-free without any trade-off.

Copy link
Copy Markdown
Collaborator

@mtjhrc mtjhrc Mar 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reviewing it @mtjhrc :)

I know this is going into a WIP branch, but I've never programmed Windows, so I was curious how this looks like! 😃 Also I was thinking of cleaning up the EventFd/Epoll abstraction and the macOS shim so I was curious if this changes things.

... but even in that case if the handle is assigned to any other object different to WCP it silently fails with STATUS_OBJECT_TYPE_MISMATCH.

That makes sense, but still - couldn't another thread create a new wait-completion-packet object and get the same recycled handle value? In that case the type would still match so NtAssociateWaitCompletionPacket() would NOT fail, right? In that case we would end up using a different wcp.

I am pretty sure there must be a low overhead (compared to the syscalls) way to do this e.g. some slightly more complex state transition with atomics (possibly using WaitOnAddress/WakeByAddress).

It's fine for now, but it should really have a comment explaining the calculated race.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, but still - couldn't another thread create a new wait-completion-packet object and get the same recycled handle value? In that case the type would still match so NtAssociateWaitCompletionPacket() would NOT fail, right? In that case we would end up using a different wcp.

Yes, that would cause a misbehavior but in this PR i think we should be safe.

  1. Thread A is in wait and it's in the middle of associating the wcp.
  2. Thread B deletes the wcp handle.
  3. Thread C creates a new wcp and windows assigns the same handle.
  4. Thread A associates wcp to the watch object again and it succeed.
  5. Thread C will try to associate wcp to its watch and it fails and it will deletes the WCP handle it created. -> https://github.com/containers/libkrun/pull/609/changes#diff-c76c1e69b53fd8bb8ced70bad6a1bb8d17eceedc7a8c82f8fb1ddb977f1f8c7cR272-R278
  6. At this point, Thread C failed and Thread A associated the WCP. The kernel will delete the WCP as it has a reference count to 0 (the handle was deleted by thread C and it was never pushed to the iocp queue because the event never triggered -> a delete was request by thread B)

The only thing that we lose is the event tried to be added by Thread C, and hopefully it will be requeued, but nothing should mess the system other than that.

I am pretty sure there must be a low overhead (compared to the syscalls) way to do this e.g. some slightly more complex state transition with atomics (possibly using WaitOnAddress/WakeByAddress).

My concern is with the GC. I'd be super happy if we can remove it somehow and i guess it would solve some edge cases.

It's fine for now, but it should really have a comment explaining the calculated race.

+1 i'll add a comment

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants