Background and motivation
For dynamic, pointer-chasing collections (Dictionary<,>, ImmutableDictionary<,>) that are large enough not to fit in cache, the dominant cost of a bulk operation is waiting for external memory, not executing instructions. The technique this proposal builds on is to interleave the execution of several operations together: instead of running one lookup (or one sub-traversal) to completion before starting the next, you advance a batch of them in lock-step, and whenever one is about to touch a memory location that will likely miss cache you issue a software prefetch for it and move on to the next. This keeps many memory accesses in flight at once, so their latencies overlap instead of being paid one after another.
This proposal is to add two opt-in batch entry points that apply this idea: a batched (vectorized) hash-find and a prefetched enumeration. The existing data structures, their layout, and every current API stay exactly as they are; the new methods sit alongside them.
A reference implementation and reproducible benchmarks are here: https://github.com/touitou-dan/BatchPrefetchDictionaries
This issue is about the collection-specific batch-prefetch prototype only. The separate, lower-level question of exposing easy-to-use .NET prefetch APIs (today only available as architecture-specific unsafe intrinsics) is already being discussed in #106860 and #106863; the work here would be a consumer of such an API. I am opening this per the suggestion to move the collection-prototype discussion onto GitHub where anyone interested can weigh in.
To be clear up front: this does not propose adding prefetching to the default collection paths. That would be risky, because the default collections are used in many ways and prefetching is not a win on average, and our own measurements agree (see "When it does not help" below). The proposal is strictly additive: opt-in batch APIs for the throughput-shaped, memory-bound, read-heavy workloads where it is a large win. The existing one-at-a-time paths are untouched.
Why software prefetch helps here, when it usually does not
Software prefetch has a deserved reputation for being unhelpful, and it is usually dismissed for two reasons. It is worth addressing both, because the batched form sidesteps them.
First, the objection that out-of-order and speculative execution already issue the loads in parallel for free. That holds when the next memory address is independent of the previous load. It does not hold inside a single dependent pointer chase: walking a hash chain dereferences entry.next, walking a tree dereferences node.Left, and the address of the next load is the result of the previous one, so the CPU cannot compute it until the prior cache-missing load has returned from DRAM. A single lookup lane therefore exposes very little memory-level parallelism. A scalar loop over many independent keys can expose some parallelism on its own (independent bucket loads across iterations, hashing, key-array reads), up to the limits of the reorder buffer and load buffers, but in a memory-bound regime that is usually not enough to saturate the available memory parallelism. Batching exposes that parallelism deliberately, and issues the prefetches far enough ahead, which is the part the hardware cannot arrange by itself.
Second, the objection that software prefetch requires getting the timing right, the gap between the prefetch and the use, which is hard to tune. In the usual single-loop setting this is real: to hide the latency the prefetch has to be issued far enough ahead of the access, and in a serial pointer chase there is often no useful work to put in between, so there is nowhere to place the prefetch far enough ahead. Interleaving reduces the sensitivity to a single tuned distance. We process the batch in N lanes round-robin: when we come back to a lane and read its address we may still stall on that read, but during the wait the other N-1 reads we already prefetched are resolving in parallel and are warm by the time we reach them. N is better understood as an amortization depth than as a per-CPU prefetch distance to calibrate. It is still a heuristic window that interacts with DRAM latency, the number of outstanding-miss buffers, cache pressure, and bandwidth, but the data shows a broad plateau (typically N around 8 to 32 on the hardware we measured), so it is far less fragile than a single hand-tuned look-ahead constant.
Prior production use: Valkey and Redis
This technique was put into production in the Valkey and Redis in-memory databases. Applying it to Valkey's main dictionary lookup (lookupKey, a chained hash) by batching incoming commands and interleaving (with prefetch) the table to dictEntry to robj chains for all their keys reduced time spent in lookupKey by more than 80% and lifted overall throughput past 1.19M RPS. The engineering write-up is here:
Valkey blog, "Unlock 1 Million RPS: Experience Triple the Speed with Valkey (part 2)": https://valkey.io/blog/unlock-one-million-rps-part2/
That is related production evidence that the pattern is effective on server-side dictionaries, not a claim that it transfers unchanged to the BCL. Valkey is C with a non-moving allocator; the .NET version has to deal with a moving GC and managed references (see Risks).
Measured results
Measured on .NET 10, 10M entries, AMD EPYC 7763 (one machine, x86-64). Each scenario compares the batch+prefetch path against an equally-shaped baseline, with bit-for-bit checksum parity between the two. Throughput is in millions per second (M/s): for the enumeration rows that means elements enumerated per second, and for the hash-find rows it means key lookups (operations) per second.
| Scenario |
Baseline (M/s) |
Batch + prefetch (M/s) |
Speedup |
| ImmutableDictionary<int, T>, enumeration (elements/s) |
7.7 |
24.4 |
3.18x |
| ImmutableDictionary<string, string>, enumeration (elements/s) |
7.7 |
19.7 |
2.55x |
| ImmutableDictionary<int, T>, hash-find (lookups/s) |
1.00 |
2.93 |
2.92x |
| ImmutableDictionary<string, …>, hash-find (lookups/s) |
0.83 |
2.45 |
2.95x |
| Dictionary<int, T>, hash-find (lookups/s) |
10.6 |
24.6 |
2.32x |
| Dictionary<string, …>, hash-find (lookups/s) |
5.0 |
11.1 |
2.24x |
Notes on the baselines, so the numbers are not read as more than they are:
- The hash-find baseline is a scalar loop of
TryGetValue over the same set of keys, with the value-object prefetch issued symmetrically on both sides (so the measured gain is the bucket/entry chase strategy, not value-dereference prefetching). It is a fair scalar baseline, but it is not the strongest conceivable one: a hand-unrolled or grouped scalar loop can expose some memory parallelism by itself, so against such a baseline the win would be somewhat smaller than against a plain loop. The numbers above should be read as "batched-prefetched vs a straightforward scalar loop", not as a universal 2 to 3x for every hash-find workload.
- The enumeration baseline is the dictionary's own struct enumerator (not a yield/IEnumerable iterator, which would add interface-dispatch overhead unrelated to prefetching). Both the baseline and the candidate are struct enumerators doing identical per-element work (accumulating the value only); the only difference is that the candidate issues software prefetches as it walks, for the tree nodes it is about to visit and for each element's value object. So the enumeration speedup is "the same walk, with prefetch, versus without".
All numbers are reproducible from the repo (--scenario <name> --size 10000000), with --validate asserting baseline/candidate count and checksum parity for every scenario. (Checksum parity is a sanity check on count and accumulated value, not a proof of exact element ordering; a runtime-quality test suite would compare exact contents and pin down any ordering guarantee.)
When it does not help (measured)
This is the crux of why the proposal is opt-in batch APIs rather than a change to default behavior.
- Already-linear scans do not benefit. Regular
Dictionary<,> enumeration is a tight linear walk over a contiguous entries array; the hardware prefetcher already handles sequential access, so batching or lane-splitting it gives no benefit. This is why the enumeration scenarios above are ImmutableDictionary<,>, whose AVL-tree walk is a pointer chase, and not Dictionary<,>.
- Small working sets that fit in L2/L3. When memory latency is not the bottleneck, the prefetch and lane overhead dominates and it can be a slight regression.
- Single-element, latency-sensitive lookups. This is a throughput optimization; one isolated
TryGetValue gets no benefit and should not pay for a batch context.
- Already-parallel workloads. The measurements are single-threaded. When many threads already saturate memory bandwidth, the per-thread memory parallelism this exposes has less headroom to convert into throughput.
I am happy to break this into per-collection issues, contribute the implementation, or adapt the benchmark harness to whatever measurement methodology the team prefers.
API Proposal
The exact surface is the main open question (see Alternative Designs), so the following is a shape sketch rather than a finished signature set. It illustrates the two capabilities.
namespace System.Collections.Generic;
public partial class Dictionary<TKey, TValue>
{
// Looks up a batch of keys in one call, interleaving the bucket/entry-chain
// walks and prefetching one hop ahead per lane. Writes each value into
// 'values' and whether it was found into 'found' (positionally aligned with
// 'keys'). Returns the number of keys found.
public int FindBatch(ReadOnlySpan<TKey> keys, Span<TValue> values, Span<bool> found);
}
namespace System.Collections.Immutable;
public partial class ImmutableDictionary<TKey, TValue>
{
public int FindBatch(ReadOnlySpan<TKey> keys, Span<TValue> values, Span<bool> found);
// Prefetched enumeration: walks the tree in N independent lanes and prefetches
// ahead. Sketch only; the consumer shape (struct foreach as below, or a
// Span-filling MoveNextBatch) is open.
public PrefetchedPairEnumerable EnumeratePrefetched(int lanes = 16);
}
API Usage
// Batched lookup: resolve many keys with a single call.
ReadOnlySpan<int> keys = GetKeysToLookUp();
Span<Payload> values = new Payload[keys.Length];
Span<bool> found = new bool[keys.Length];
int hits = dictionary.FindBatch(keys, values, found);
for (int i = 0; i < keys.Length; i++)
{
if (found[i])
Process(values[i]);
}
// Prefetched enumeration over a large ImmutableDictionary.
foreach (KeyValuePair<int, Payload> kv in immutable.EnumeratePrefetched(lanes: 16))
{
Accumulate(kv.Value);
}
Alternative Designs
Alternative approaches to the same goal (faster bulk operations on large, memory-bound dictionaries), both of which we tried in C in the past. We have not reproduced these two alternatives in .NET, so the disadvantages noted below are observations from the C experiments rather than verified .NET results:
- Implement the lanes as coroutines. Instead of explicit lane arrays and a hand-written round-robin, express each operation as a coroutine that yields at every likely-miss memory access, and let a scheduler interleave a batch of them (the idea is developed in "Exploiting Coroutines to Attack the Killer Nanoseconds", VLDB 2018, https://www.vldb.org/pvldb/vol11/p1702-jonathan.pdf; a .NET-side example of fiber-style scheduling is AsyncFiberWorks). The advantage is that the code is simpler to implement, more generic, and more readable. The disadvantage is performance: the coroutine and scheduling overhead eats into the win, so it does not reach the throughput of the explicit-lane version.
- Touch the object instead of prefetching it. Rather than issuing a software prefetch for the next address, dereference one of the object's fields (a real load) before yielding to the next lane, so the demand load is already in flight while the other lanes run. This needs no prefetch intrinsic. The disadvantage is that a real load only overlaps with the other lanes if the compiler does not reorder or eliminate it and the hardware's out-of-order and speculative execution actually keeps it outstanding while the next lane runs; a software prefetch states that intent explicitly and does not depend on those. As a result this variant is more sensitive to compiler optimization and, more importantly, to the processor's speculative-execution capability: it does not reach the performance of a true prefetch, and on some hardware the improvement is very limited.
Risks
- GC-safe prefetch. The prototype obtains an address from a managed reference and prefetches it. For the BCL this must be done with a prefetch primitive that is safe under a moving GC and across safepoints (taking a
ref/byref rather than a raw nint), so that no unmanaged object-address is held across a point where the object could move. This is the main implementation risk and ties into #106860 / #106863.
- Cross-platform. The measurements are x86 only; the reference uses
System.Runtime.Intrinsics.X86 prefetch. .NET does not currently expose a managed ARM64 prefetch intrinsic at all (the ARM64 ISA has PRFM, but there is no .NET API for it yet; adding one is exactly what #106860 / #106863 propose). So on ARM today the only option is the lane/interleave path with no prefetch step. That restructuring already wins on its own; the prefetch instruction adds the final increment on top, and the implementation needs a clean no-op fallback on platforms where no prefetch intrinsic is available. ARM64 behavior, once such an intrinsic exists, should be benchmarked separately.
- Concurrency. The batch path has the same thread-safety contract as the existing scalar API: concurrent mutation is unsupported in both, and the batched implementation keeps
Dictionary's existing concurrent-write detection (version check) and corrupted-chain guards, so misuse fails the same safe, detectable way it does today. It introduces no new sharing or shared mutable state of its own.
- Regression outside the sweet spot. As measured, small or cache-resident collections can regress slightly. Keeping the API opt-in (the caller only reaches for it when their workload matches) is the mitigation.
Background and motivation
For dynamic, pointer-chasing collections (
Dictionary<,>,ImmutableDictionary<,>) that are large enough not to fit in cache, the dominant cost of a bulk operation is waiting for external memory, not executing instructions. The technique this proposal builds on is to interleave the execution of several operations together: instead of running one lookup (or one sub-traversal) to completion before starting the next, you advance a batch of them in lock-step, and whenever one is about to touch a memory location that will likely miss cache you issue a software prefetch for it and move on to the next. This keeps many memory accesses in flight at once, so their latencies overlap instead of being paid one after another.This proposal is to add two opt-in batch entry points that apply this idea: a batched (vectorized) hash-find and a prefetched enumeration. The existing data structures, their layout, and every current API stay exactly as they are; the new methods sit alongside them.
A reference implementation and reproducible benchmarks are here: https://github.com/touitou-dan/BatchPrefetchDictionaries
This issue is about the collection-specific batch-prefetch prototype only. The separate, lower-level question of exposing easy-to-use .NET prefetch APIs (today only available as architecture-specific unsafe intrinsics) is already being discussed in #106860 and #106863; the work here would be a consumer of such an API. I am opening this per the suggestion to move the collection-prototype discussion onto GitHub where anyone interested can weigh in.
To be clear up front: this does not propose adding prefetching to the default collection paths. That would be risky, because the default collections are used in many ways and prefetching is not a win on average, and our own measurements agree (see "When it does not help" below). The proposal is strictly additive: opt-in batch APIs for the throughput-shaped, memory-bound, read-heavy workloads where it is a large win. The existing one-at-a-time paths are untouched.
Why software prefetch helps here, when it usually does not
Software prefetch has a deserved reputation for being unhelpful, and it is usually dismissed for two reasons. It is worth addressing both, because the batched form sidesteps them.
First, the objection that out-of-order and speculative execution already issue the loads in parallel for free. That holds when the next memory address is independent of the previous load. It does not hold inside a single dependent pointer chase: walking a hash chain dereferences
entry.next, walking a tree dereferencesnode.Left, and the address of the next load is the result of the previous one, so the CPU cannot compute it until the prior cache-missing load has returned from DRAM. A single lookup lane therefore exposes very little memory-level parallelism. A scalar loop over many independent keys can expose some parallelism on its own (independent bucket loads across iterations, hashing, key-array reads), up to the limits of the reorder buffer and load buffers, but in a memory-bound regime that is usually not enough to saturate the available memory parallelism. Batching exposes that parallelism deliberately, and issues the prefetches far enough ahead, which is the part the hardware cannot arrange by itself.Second, the objection that software prefetch requires getting the timing right, the gap between the prefetch and the use, which is hard to tune. In the usual single-loop setting this is real: to hide the latency the prefetch has to be issued far enough ahead of the access, and in a serial pointer chase there is often no useful work to put in between, so there is nowhere to place the prefetch far enough ahead. Interleaving reduces the sensitivity to a single tuned distance. We process the batch in N lanes round-robin: when we come back to a lane and read its address we may still stall on that read, but during the wait the other N-1 reads we already prefetched are resolving in parallel and are warm by the time we reach them. N is better understood as an amortization depth than as a per-CPU prefetch distance to calibrate. It is still a heuristic window that interacts with DRAM latency, the number of outstanding-miss buffers, cache pressure, and bandwidth, but the data shows a broad plateau (typically N around 8 to 32 on the hardware we measured), so it is far less fragile than a single hand-tuned look-ahead constant.
Prior production use: Valkey and Redis
This technique was put into production in the Valkey and Redis in-memory databases. Applying it to Valkey's main dictionary lookup (
lookupKey, a chained hash) by batching incoming commands and interleaving (with prefetch) the table to dictEntry to robj chains for all their keys reduced time spent inlookupKeyby more than 80% and lifted overall throughput past 1.19M RPS. The engineering write-up is here:Valkey blog, "Unlock 1 Million RPS: Experience Triple the Speed with Valkey (part 2)": https://valkey.io/blog/unlock-one-million-rps-part2/
That is related production evidence that the pattern is effective on server-side dictionaries, not a claim that it transfers unchanged to the BCL. Valkey is C with a non-moving allocator; the .NET version has to deal with a moving GC and managed references (see Risks).
Measured results
Measured on .NET 10, 10M entries, AMD EPYC 7763 (one machine, x86-64). Each scenario compares the batch+prefetch path against an equally-shaped baseline, with bit-for-bit checksum parity between the two. Throughput is in millions per second (M/s): for the enumeration rows that means elements enumerated per second, and for the hash-find rows it means key lookups (operations) per second.
Notes on the baselines, so the numbers are not read as more than they are:
TryGetValueover the same set of keys, with the value-object prefetch issued symmetrically on both sides (so the measured gain is the bucket/entry chase strategy, not value-dereference prefetching). It is a fair scalar baseline, but it is not the strongest conceivable one: a hand-unrolled or grouped scalar loop can expose some memory parallelism by itself, so against such a baseline the win would be somewhat smaller than against a plain loop. The numbers above should be read as "batched-prefetched vs a straightforward scalar loop", not as a universal 2 to 3x for every hash-find workload.All numbers are reproducible from the repo (
--scenario <name> --size 10000000), with--validateasserting baseline/candidate count and checksum parity for every scenario. (Checksum parity is a sanity check on count and accumulated value, not a proof of exact element ordering; a runtime-quality test suite would compare exact contents and pin down any ordering guarantee.)When it does not help (measured)
This is the crux of why the proposal is opt-in batch APIs rather than a change to default behavior.
Dictionary<,>enumeration is a tight linear walk over a contiguous entries array; the hardware prefetcher already handles sequential access, so batching or lane-splitting it gives no benefit. This is why the enumeration scenarios above areImmutableDictionary<,>, whose AVL-tree walk is a pointer chase, and notDictionary<,>.TryGetValuegets no benefit and should not pay for a batch context.I am happy to break this into per-collection issues, contribute the implementation, or adapt the benchmark harness to whatever measurement methodology the team prefers.
API Proposal
The exact surface is the main open question (see Alternative Designs), so the following is a shape sketch rather than a finished signature set. It illustrates the two capabilities.
API Usage
Alternative Designs
Alternative approaches to the same goal (faster bulk operations on large, memory-bound dictionaries), both of which we tried in C in the past. We have not reproduced these two alternatives in .NET, so the disadvantages noted below are observations from the C experiments rather than verified .NET results:
Risks
ref/byref rather than a rawnint), so that no unmanaged object-address is held across a point where the object could move. This is the main implementation risk and ties into #106860 / #106863.System.Runtime.Intrinsics.X86prefetch. .NET does not currently expose a managed ARM64 prefetch intrinsic at all (the ARM64 ISA hasPRFM, but there is no .NET API for it yet; adding one is exactly what #106860 / #106863 propose). So on ARM today the only option is the lane/interleave path with no prefetch step. That restructuring already wins on its own; the prefetch instruction adds the final increment on top, and the implementation needs a clean no-op fallback on platforms where no prefetch intrinsic is available. ARM64 behavior, once such an intrinsic exists, should be benchmarked separately.Dictionary's existing concurrent-write detection (version check) and corrupted-chain guards, so misuse fails the same safe, detectable way it does today. It introduces no new sharing or shared mutable state of its own.