Skip to content

Drop max(1, ...) floor in issueOutboundFanout (#730)#730

Open
excelle08 wants to merge 17 commits into
facebookresearch:v2-betafrom
excelle08:export-D105119225-to-v2-beta
Open

Drop max(1, ...) floor in issueOutboundFanout (#730)#730
excelle08 wants to merge 17 commits into
facebookresearch:v2-betafrom
excelle08:export-D105119225-to-v2-beta

Conversation

@excelle08

@excelle08 excelle08 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary:

issueOutboundFanout previously rounded the per-method call count via std::max(1, round(perSessionCounts[i] * scale)). At the default --rpc_fanout_scale=0.025, methods with perSessionCounts() < 20 (the long-tail outbound RPCs production hits about once per 200 sessions) ended up appearing once per session. That inflated the per-method ratio of slow-but-infrequent methods relative to their production frequency, and the inflated tail samples dominated the aggregate fanout latency distribution observed by issueOutboundFanout's collectAll.

Replace the floor with a clean drop: round to the nearest integer, then continue if the result is zero. Methods whose expected per-session count is below 0.5/scale (= 20 at default scale) are skipped entirely instead of being over-represented at one call per session.

Reviewed By: YifanYuan3

Differential Revision: D105119225

excelle08 added 7 commits July 2, 2026 13:08
Summary:
Add mock feature extraction pipeline to FeedSim with large-scale code
generation for I-cache and frontend pressure. 27 genuinely diverse code
patterns (derived from studying 696 production feature extractors) generate
~700 variants × 1000 copies = ~700K unique functions at install time.

Key components:
- 6 hand-written extractors based on production leaf function profiling
- 27 pattern-specific code generators (P01-P27) producing genuinely
  different instruction sequences (different branch topologies, loop
  nesting, data access patterns, code sizes from 10 to 2300 lines)
- Flat shuffled dispatch: all copy function pointers shuffled into one
  vector, iterated sequentially per request for maximum I-cache pressure
- DLRM medium/large model generation on-server during install
- Configurable via --num_stories, --extractors_per_story, --feature_complexity

Results on T1_BGM (Bergamo, 176 cores):
  500K calls/req: L1 I-Cache MPKI 21.34 (prod target 21), IPC 0.69 (prod 0.6-0.8)
  100K calls/req: Frontend Bound 23.5%, IPC 1.22, QPS 242

Results on T11_GRC_ARM (Grace, 72 cores):
  100K calls/req: IPC 0.52, L1 I-Cache MPKI 15.91
  Medium DLRM + 100K calls: IPC 1.03 (prod target 1.05)

Differential Revision: D97022149
Summary:
Replace scalar FP transforms with integer hash operations, add data-dependent
conditional branches, eliminate FP division with integer reciprocal approximation,
deepen MockHashTable::find() call chain from 1 to 5 levels, and increase basic
block sizes with MurmurHash-style computation chains.

Results (5/7 instruction mix targets met on CPL):
- Scalar FP: 8% → 0.46% (target <3.5%) ✓
- Conditional branches: 5.28% → 10.73% (target >15%) ✗
- Near call/return: 3.93% → 0.75% (target <1.5%) ✓
- Memory (ld+st): 51.56% → 40.72% (target 40-46%) ✓
- Divider active: 11.50% → 1.13% (target <2%) ✓
- Avg BB size: 7.3 → 13.7 (LBR, target >18) ✗

QPS impact: CPL -2.2%, BGM -6.3%, Grace -3.8%.

Differential Revision: D99494831
Summary:
Replace oldisim framework internals (libevent server, pthreads, boost::lockfree) with
folly-based equivalents for LeafNodeRank and DriverNodeRank. ParentNodeRank retains
oldisim dependency.

New files:
- FeedSimProtocol.h: Wire protocol types (binary compatible with oldisim)
- FeedSimServer.{h,cc}: Server using folly::AsyncServerSocket + folly::EventBase
- FeedSimDriver.{h,cc}: Client driver with libevent for timer precision

Modified:
- LeafNodeRank.cc: Use feedsim::FeedSimServer, feedsim::RequestContext
- DriverNodeRank.cc: Use feedsim::FeedSimDriver, feedsim::TestDriver
- CMakeLists.txt: Add FeedSimFramework library, replace OLDISim link dep
- utils.h: Remove oldisim DIE() dependency
- run.sh: Change readiness check from HTTP monitor port to TCP data port

Differential Revision: D99498073
Summary:
Phase 3 of FeedSim v2 refactor. Client loads the Silesia compression corpus
(203MB, 12 files) via mmap, picks random snippets as "stories," and sends them
to the server via thrift. Server uses story content to derive data-dependent
feature extraction inputs and DLRM features instead of random data.

Changes:
- Add StoryContent/StoryBatch thrift types and story_batch field on RankingRequest
- Add SilesiaLoader.h: mmap-based corpus loader with random snippet serving
- Update DriverNodeRank to load Silesia at startup, populate stories per request
- Update LeafNodeRank to extract stories, derive features from content bytes
  (byte frequency histogram -> dense features, rolling bigram hash -> sparse)
- Rewrite DLRMRequestHandler from sync to async with folly futures
  (I/O simulation + compression + pointer chase, pipelined)
- Add storyContent/storyContentLength fields to CopyContext for extractors
- Fix feature_suite missing from ThreadData (lost during rebase)
- Fix $feature_opts not passed to LeafNodeRank in run.sh
- Fix runFeatureExtraction() never called from request handlers
- Fix ICacheBuster SIGSEGV: init moved outside PAGERANK block
- Remove ICacheBuster from DLRMRequestHandler (DLRM inference is own workload)
- Add --silesia-dir, --stories-per-request, --story-size-min/max CLI options
- Download Silesia corpus during install (x86 + aarch64)

Differential Revision: D104076734
Summary:
FeedSim's profile shows RPC at 4-5% vs production's 30-34% — partly because the benchmark sends tiny requests with no resemblance to production's payload size distribution. This diff lets the client sample a target serialized request size from a JSON percentile distribution and pad the request to hit that size.

Changes:
- Add `optional binary padding` field to `RankingRequest` thrift struct
- New `RequestSizeSampler` (header-only) loads a JSON file of percentile data (`req_size_min`, `req_size_p05`, ... `req_size_max`) and samples target sizes via inverse-CDF with linear interpolation between percentiles
- Add `--req_size_dist <json>` flag to DriverNodeRank. When present, each request is built normally, then padded with Silesia bytes (or random bytes if Silesia not loaded) to reach the sampled target size
- Plumb `--req-size-dist` through `run.sh` with auto-detection of `feed_aggregator_req_sizes.json` next to `run.sh`
- Bundle `feed_aggregator_req_sizes.json` and `feed_aggregator_resp_sizes.json` (production data from ServiceRouter) and copy them in install scripts

Differential Revision: D102693799
…onse generator

Summary:
Several cleanups in LeafNodeRank, all motivated by the leaf-only hot-func breakdown which surfaced ~15% of CPU on server-side response RNG and the misleading-named dlrmInferenceServerSideDataGeneration:

1. Split DLRM inference into two functions, both async (return folly::Future<int>):
   - dlrmInferenceServerSide: inference path where features are generated inside DLRM::infer (server-side)
   - dlrmInferenceClientSide: inference path that uses DLRM::inferWithFeatures with client-provided dense+sparse features

   The old name dlrmInferenceServerSideDataGeneration hid the actual ML inference call (this_thread.dlrm_ranker->infer) inside a function whose name suggested it was just generating feature data. The new names are honest. A shared shardInferences() helper distributes work across cpu_threads_arg.

2. Rewrite DLRMRequestHandler to be fully async with a single future chain (DLRM inference -> I/O sleep -> compression -> pointer chase -> generate+send response). The previous code blocked synchronously on the inference future via .get() before starting the rest of the pipeline. Now the handler thread returns immediately after kicking off the chain.

3. Pick the right inference function based on what the client sent: if request.dlrm_features() is set, use dlrmInferenceClientSide (no server RNG for inputs); otherwise dlrmInferenceServerSide.

4. Remove the dead `else if (g_workload_type == DLRM)` branches from PageRankRequestHandler and AsyncPageRankRequestHandler. DLRM workload requests use kDLRMRequestType, which routes to DLRMRequestHandler (registered separately in main()), so the DLRM branch in PageRank handlers was unreachable.

5. New Silesia-backed server response generator (generators/SilesiaResponseGenerator.h). When the server is started with --silesia_dir, response RankingObjects/RankingStorys are populated by slicing bytes out of the mmap'd Silesia corpus instead of running xor128() RNG. Replaces ~15% of leaf CPU previously spent on RNG (mersenne_twister, generateRandomString, xor128) with cheap memcpy from a hot mmap. The bytes have realistic entropy for downstream ZSTD compression.

6. Added LeafNodeRank --silesia_dir option and plumbed through run.sh. The same --silesia-dir flag now provides bytes both to DriverNodeRank's story_batch and to the server's response generator.

Differential Revision: D103100125
…KER, GlobalCPUThread

Summary:
LeafNodeRank's existing thread pools were anonymous from Strobelight's point of view, so the prod-vs-bench thread-pool breakdown landed almost entirely in the `Unknown` / framework-noise bucket on CPL and BGM. Per the multifeed_aggregator prod profile (~/feedsim_v2/profiles/multifeed_aggregator_main_prod/), the four hot pools are `ThriftSrv.IO`, `SREventBase{N}`, `RANKER-{N}`, and `GlobalCPUThread` (see ~/feedsim_v2/docs/phase4_researcher_notes.md section 1).

This diff is the Programmer-A half of Phase 4 (thread pools only). It renames the four existing pools to match the prod names that Strobelight categorizes, and adds two new pools (`SREventBase`, `RANKER`) that Phase 5 will start dispatching outbound RPC fanout onto. Programmer-B's diff (thrift schema + 5 new method registrations) lands as a sibling commit; via() callsites stay on their existing pool aliases so this rename is a no-op behaviorally.

Changes:
- `cpuThreadPool` is now backed by `folly::getGlobalCPUExecutor()` (the folly singleton already exposes its threads as `GlobalCPUThread` via `NamedThreadFactory("GlobalCPUThreadPool")` in `folly/executors/GlobalExecutor.cpp:65`). DO NOT instantiate a second CPU pool — that would double-count the prod `GlobalCPUThread` category. The `ThreadData::cpuThreadPool` field type changed from `shared_ptr<CPUThreadPoolExecutor>` to `shared_ptr<folly::Executor>` so the same field can hold the global singleton via an aliasing shared_ptr that owns a `folly::Executor::KeepAlive<>`. All existing `folly::via(this_thread.cpuThreadPool.get(), ...)` callsites continue to compile because `folly::via` accepts an `Executor*`.
- `srvCPUThreadPool` is now `NamedThreadFactory("RANKER")`, sized `max(1, nproc/2)` by default (matches CPL prod: 26 RANKER threads on a 52-logical-core host). Tunable via the new `--ranker_threads` flag.
- `ioThreadPool` is now `NamedThreadFactory("ThriftSrv.IO")`, sized `--io_threads`. Was previously anonymous (the kernel just labeled the threads with the executable name).
- New `srEventBasePool` (`folly::IOThreadPoolExecutor`, `NamedThreadFactory("SREventBase")`), sized `max(1, nproc * 7 / 10)` (matches CPL prod: 39 SREventBase threads on a 52-logical-core host). Tunable via the new `--sr_event_base_threads` flag. Idle in Phase 4 — Phase 5 wires the outbound mock_services fanout onto it. Threads are warmed up at server start so Strobelight sees them in steady state.
- The legacy `srvIOThreadPool` (compression dispatch) is preserved for now and retired in Phase 6 once compression callsites move onto `GlobalCPUThread` per the researcher notes.
- New CLI flags `--ranker_threads` and `--sr_event_base_threads` in `LeafNodeRankCmdline.ggo`. Default `0` means "auto-compute from `folly::available_concurrency()` per the formulas above".
- Includes added: `<folly/executors/GlobalExecutor.h>` and `<folly/system/HardwareConcurrency.h>`. Both are already on `${FOLLY_INCLUDE_DIR}` in the existing CMake target so no `CMakeLists.txt` edits were required.

Differential Revision: D103766488
@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2026
@meta-codesync

meta-codesync Bot commented Jul 2, 2026

Copy link
Copy Markdown

@excelle08 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D105119225.

excelle08 added 10 commits July 2, 2026 17:09
Summary:
Phase 5 of the FeedSim v2 refactor needs LeafNodeRank to issue real outbound RPCs against a separate Thrift server so Strobelight categorizes the resulting CPU samples into the same rpc-stack/serialization/transport buckets as production multifeed_aggregator. This diff adds the mock_services binary that stands in for the 20 outbound RPC types observed in the production profile.

The server is built on real apache::thrift::ThriftServer, not the FeedSimServer hand-rolled AsyncServerSocket loop, so loopback dispatch goes through Cpp2Worker, RocketServerConnection, RequestRpcMetadata, CompactProtocolWriter, etc. exactly as it would in prod. The 20 thrift methods all share the same wire signature `binary <method>(1: binary request, 2: i32 latency_us)` (the design from phase5_researcher_notes section 1, option (c)) and dispatch to a single shared handler body. Distinct method names exist purely so Strobelight per-method attribution lines up with prod.

Wire contract: caller writes a uint32 big-endian response_size in the first 4 bytes of `request` and then opaque padding sized to the request percentile. The server sleeps/spins for `latency_us` and replies with `response_size` bytes copied from the Silesia corpus. Short-tail latencies (<200us) burn the IO thread to keep rpc-stack samples on-CPU; longer latencies hop to the global timekeeper.

Files added: MockService.thrift (IDL, 20 methods), MockServiceHandler.{h,cc} (single shared runSimulatedRpc body, 20 trivial wrappers behind a macro), MockServiceMain.cc (folly::Init + ThriftServer setup), BUCK (thrift_library + cpp_binary, with a -I flag pulling SilesiaLoader.h from the parent ranking/ dir since that dir has no BUCK file), CMakeLists.txt (open-source build path; mirrors the parent ranking/ pattern). Parent ranking/CMakeLists.txt picks up the new dir via add_subdirectory. The binary ships in the cea.chips.benchpress fbpkg automatically via the existing buck_filegroup glob over packages/feedsim/**.

Programmer B (sibling diff in this stack) wires LeafNodeRank's MockServiceAsyncClient and the issueOutboundFanout switch; Programmer C migrates compression to ManagedCompression. No file conflicts with this diff.

Differential Revision: D103766817
Summary:
Phase 4 programmer-B: introduce production-shaped multifeed aggregator thrift schema and dispatch IDs so the FeedSim leaf node can be exercised by per-method driver traffic in Phase 6. Builds on Phase 4-A (`11e9bc9a3431` — pool rename to ThriftSrv.IO/SREventBase/RANKER/GlobalCPUThread) and Phase 5-A (`31948cd0d579` — mock_services binary).

Three changes, all additive (no existing struct/handler is removed — Phase 6 deletes RankingRequest/RankingResponse and the legacy kPageRank/kDLRM type IDs):

1. `if/ranking.thrift` — five new request/response struct pairs sized to the p50 wire targets from `~/feedsim_v2/profiles/rpc_dist.json`:
   - CreateAndPrimeSessionRequest/Response (379 B / 44 B)
   - GetStoriesRequest/Response (2.13 MB / 171 KB) — also adds shared helpers GetStoriesResponseStats and RankedStoryInfo
   - GetAllStoriesRequest/Response (55 B / 1.47 MB)
   - StreamDataRequest/Response (58 KB / 4 B) plus StreamingUseCase enum
   - StreamIfrPriorityRankingRequest/Response (949 KB / 4 B)

   Each request struct mirrors prod field counts and types (including primitive vs container vs binary) per `~/feedsim_v2/docs/phase4_researcher_notes.md` section 3, so CompactProtocol serialization cost is realistic. Bulk wire size lives in named `binary` fields (e.g. `settings_compressed`, `serialized_payload`, `ifr_objects_serialized`) that the Phase 6 driver populates by sampling from the percentile table.

2. `RequestTypes.h` — five new uint32_t constants `0x10..0x14` for the new methods. Existing `kPageRankRequestType` (0x00) and `kDLRMRequestType` (0x01) stay so the in-flight stack keeps working.

3. `LeafNodeRank.cc` — five new shim handler functions and matching `registerQueryCallback` calls:
   - Heavy methods (`getStoriesUncompressed`, `getAllStories`) deserialize the new struct then route to the existing `DLRMRequestHandler`. Phase 4 CPU profile is unchanged for those.
   - Light methods (`createAndPrimeSession`, `streamData`, `streamIfrPriorityRanking`) deserialize, then send a small fixed-size response (44 B / 4 B / 4 B) without invoking `DLRMRequestHandler`. Production p50 latencies for these are 3-13 ms with 4-44 B responses, so attributing DLRM CPU to them in Phase 4 testing would distort the profile. Phase 6 replaces these shims with real per-method handlers (session bookkeeping, ack-only paths, IFR scoring).

Sizing methodology: targets are p50 wire sizes from `rpc_dist.json`. Computed sizes are CompactProtocol overhead (1 byte per short field tag, 2 bytes for tags >15, varint length + N bytes data for binary, ~1 byte stop) plus the binary field contents the driver supplies:

   | Method                        | Target p50 | Size source                                                          |
   |-------------------------------|-----------:|----------------------------------------------------------------------|
   | CreateAndPrimeSessionRequest  |      379 B | ~110 B field overhead + ~270 B `session_init_blob`                    |
   | CreateAndPrimeSessionResponse |       44 B | ~7 B field overhead + 32-char hex `session_id` (~36 B) + 4 B status   |
   | GetStoriesRequest             |   2.13 MB  | ~150 B fixed fields + 5 binary blobs (driver fills to ~2.07 MB total) |
   | GetStoriesResponse            |   171 KB   | ~50 B fixed + ~100 stories x ~1.5 KB story_payload + ~10 KB debug    |
   | GetAllStoriesRequest          |       55 B | 36 B session_id + 8 B query_id + ~10 B caller_id + ~6 B overhead     |
   | GetAllStoriesResponse         |   1.47 MB  | ~50 B fixed + ~500-1000 stories x ~1.5 KB + ~10 KB debug             |
   | StreamDataRequest             |    58 KB   | ~50 B fixed + driver-sampled `serialized_payload` (bimodal in prod)  |
   | StreamDataResponse            |        4 B | 1 B field header + 1 B i32 zigzag + 1 B stop = 3-4 B                 |
   | StreamIfrPriorityRankingReq   |   949 KB   | ~80 B fixed + driver-sampled `ifr_objects_serialized` etc.           |
   | StreamIfrPriorityRankingResp  |        4 B | identical encoding to StreamDataResponse                              |

Because the binary fields are sampled per-request from the percentile table (in Phase 6), every method can hit not just p50 but the entire prod distribution (p05/p25/p75/p95). The structs themselves carry no binary defaults.

Generated `gen-cpp2/ranking_types.h` is regenerated by CMake at fbpkg-install time; the checked-in copy predates `RankingRequest`/`DLRMFeatures`/`StoryBatch` and is also missing those, confirming it is rebuilt out-of-tree.

Differential Revision: D103767023
Summary:
Migrate the three ZSTD compression callsites in `LeafNodeRank.cc` from raw `folly::compression::getCodec(CodecType::ZSTD)` to ManagedCompression, the documented Meta standard for application-level compression in fbcode (per `fbcode/.llms/rules/managed_compression.md` and the `managed_compression_integration` skill). ManagedCompression handles dictionary training, parameter tuning, and rollout via infrastructure rather than hard-coded codec choice/level.

Three callsites migrated, two categories:
- `compressPayload` (line ~399) — `leaf_random_string` category
- `decompressPayload` (line ~409) — `leaf_random_string` category (same as compressPayload, since both operate on the same pseudo-random payload bytes — required so ManagedCompression serves the right dictionary on decompress)
- `compressThrift` (line ~416) — `leaf_thrift_payload` category (serialized RankingResponse / CompactProtocol — distinct payload shape)

Following the canonical pattern from `common/managed_compression/examples/ManagedCompressionExample.cpp`:
- One process-wide `folly::Singleton<ManagedCompressionFactory>` keyed by oncall=`chips_dcperf` and project=`feedsim`. Constructing a factory per call is explicitly discouraged.
- `getCachedCodec(category)` per category, preferred over `getCodec()` for hot paths.
- 2 categories, both clearly distinct payload shapes (random bytes vs thrift CompactProtocol). Per the skill, category count is kept modest.

Open-source build path: the benchpress repo is open-sourced and ManagedCompression is internal-only, so the include and use sites are gated behind `#ifdef BENCHPRESS_INTERNAL`. When the gate is undefined (the current OSS / CMake build path that all install scripts use today), the original raw folly ZSTD code remains as the fallback so the OSS build still works. `LeafNodeRank.cc` has no Buck build target — it is built only by CMake at fbpkg-install time — so no Buck-side wiring is needed in this commit. A follow-up (Phase 6) can add `-DBENCHPRESS_INTERNAL=1` to the internal CMake invocation to flip the gate on.

Stack position: depends on `bebd655f6d` (Phase 4-B thrift structs). Sibling of Phase 5-A mock_services (`bd2517aa83`).

Differential Revision: D103768051
Summary:
Phase 5 closes the loop on the prod-shaped RPC stack: LeafNodeRank now issues real outbound Thrift RPCs to the mock_services server stood up in 5/1, replacing the synthetic `folly::futures::sleep(io_latency_ms)` callsites in the request handlers. This is the change that actually puts the RPC stack on-CPU during a request, which is what the multifeed_aggregator profile spends most of its time in.

# Generalize percentile sampling: PercentileSampler + RpcDistRegistry

`RequestSizeSampler` was hard-coded to one prefixed distribution per JSON file. We now need 60 distributions (20 outbound methods x {request_size, response_size, latency_us}). `PercentileSampler.h` is the generalized inverse-CDF sampler:
- `load(path, prefix)` keeps the legacy prefixed-keys shape used by DriverNodeRank's `--request_size_distribution` flag.
- `loadFromDynamic(obj)` accepts a bare `{min,p05,...,max}` object — the shape used by every per-method sub-object in `rpc_dist.json`.
- `sample(rng)` and `sampleI64(rng)` cover both size and latency distributions.

`RequestSizeSampler.h` is now a one-line `using` alias so DriverNodeRank keeps building unmodified.

`RpcDistRegistry.h` loads `rpc_dist.json` once at startup and exposes the 60 outbound samplers via either `MethodIdx` enum or string-keyed accessor. The `kPerSessionCounts` table is hard-coded from the researcher notes (not parsed from the JSON) so a missing or stale `rpc_dist.json` cannot silently change the fanout calibration. See ~/feedsim_v2/docs/phase5_researcher_notes.md §4 for the per-method numbers.

# Per-thread Thrift client: MockServicesClient

`MockServicesClient` wraps the generated `MockServiceAsyncClient` with a compile-time switch over `MethodIdx`. We deliberately use the named `semifuture_<method>()` calls rather than a single dynamic-name dispatch — preserving distinct `AsyncClient::send_<method>` symbols in Strobelight, which is the entire reason `MockService.thrift` declares 20 methods rather than one generic `call()`.

Thread-safety strategy: one client per LeafNodeRank worker thread, each pinned to one EventBase from the SREventBase pool (the outbound-EventBase pool added in Phase 4). `MockServicesClient`'s constructor and destructor both `runInEventBaseThreadAndWait` to keep the AsyncClient + RocketClientChannel on their owning EventBase thread.

Wire contract (matches what mock_services from 5/1 expects): the first 4 bytes of the request body are a big-endian `uint32_t response_size`. The server reads that header and sizes its response accordingly, so client and server stay in sync without an out-of-band agreement.

# Fanout integration in LeafNodeRank

Three new CLI flags:
- `--rpc_dist_path`: path to `rpc_dist.json`. Default empty — when unset, the legacy `folly::futures::sleep` path is preserved verbatim (regression-safety A/B comparison).
- `--mock_services_host` / `--mock_services_port`: target (defaults `127.0.0.1:21222`).
- `--rpc_fanout_scale`: scale factor on per-session counts. Default `0.025` yields ~94 RPCs per inbound session (vs ~3742 at scale=1.0); the table in §4 of the researcher notes documents the calibration tradeoff.

`issueOutboundFanout(td, scale)` iterates the 20 methods, computes `n = max(1, round(per_session_count * scale))` for each, samples request size / response size / latency from the registry, builds a request body (4-byte BE header + Silesia bytes if available, else zero-filled padding), and dispatches via the per-thread `MockServicesClient`. The Future<int> resolves once `folly::collectAll` of all per-call futures completes.

`simulateIoOrFanout(td, ...)` is the drop-in replacement for `folly::futures::sleep`. When `td.mock_client` is non-null (i.e. `--rpc_dist_path` was set), it fans out; otherwise it sleeps. The three callsites that get this treatment are the I/O simulation in `AsyncPageRankRequestHandler`, `DLRMRequestHandler`, and the legacy sync `PageRankRequestHandler`. The 1ms inter-stage breather around line 1735 is intentionally left as `folly::futures::sleep` per the researcher notes — that's not modeled I/O.

ThreadStartup populates `td.rpc_registry`, `td.rpc_silesia`, `td.mock_client`, and a per-thread `std::mt19937 rpc_rng`. If the connection to mock_services fails at startup, the code logs and falls back to the legacy sleep path for that thread rather than aborting (defensive — a handful of slow startups shouldn't take down the whole benchmark, and the warning will surface in install logs).

# Build wiring

CMake: `LeafNodeRank` now compiles `MockServicesClient.cc` and links `MockService-cpp2`. Added the corresponding `add_dependencies(LeafNodeRank MockService-cpp2-target)` so the thrift bindings are generated before LeafNodeRank starts compiling. `MockServicesClient.cc` includes `mock_services/gen-cpp2/MockServiceAsyncClient.h` directly; the existing `${CMAKE_CURRENT_SOURCE_DIR}` include path makes that visible.

# Test calibration

20 methods × scale=0.025 = exactly **94 calls per session** (matches the researcher table — verified independently by recomputing the ceil/round). Per-method breakdown is in researcher notes §4.

Differential Revision: D103772853
Summary:
Programmer-A scope of Phase 6 per `~/feedsim_v2/docs/phase6_researcher_notes.md` sections 1, 2, 3, and 6. Server-side handlers (§4) and legacy-code deletion (§5) are split into Phase 6/2 (Programmer-B) and Phase 6/3 (Programmer-C) commits.

Four changes:

1. `FeedSimDriver` SemiFuture API (`FeedSimDriver.h`, `FeedSimDriver.cc`). New `TestDriver::sendRequestAndAwait(type, payload, length)` returns a `folly::SemiFuture<std::string>` that fulfills with the response payload bytes when the matching `ResponsePacketHeader` arrives. Implementation: per-`TestDriver` `folly::F14FastMap<uint64_t, folly::Promise<std::string>> pending_promises` keyed by `request_id`; `next_request_id` promoted from a plain `uint64_t` to `std::atomic<uint64_t>` so callers from arbitrary threads can mint IDs without contention. `event_base_once` hops the actual write onto the libevent thread (the only thread that may touch the bufferevent state). `readCb` parses the response header, copies the payload out of the libevent buffer, looks up the `request_id` under `pending_mutex`, moves the promise out, drops the lock, and calls `setValue`. The legacy fire-and-forget `sendRequest` stays in place — Phase 6-C deletes it.

2. `RunSession` orchestration (`DriverNodeRank.cc`). New `RunSession(thread_id, driver, thread_data)` builds the 4-step session pipeline that mirrors prod multifeed_aggregator: createAndPrimeSession (await) -> getStoriesUncompressed (HOLD future) + streamData * N (parallel) + optional streamIfrPriorityRanking coin flip -> await all streamData -> await getStoriesUncompressed and record first-story latency -> getAllStories (await) -> done. Encoders `encodeCreateAndPrime`, `encodeGetStories`, `encodeStreamData`, `encodeStreamIfrPriority`, `encodeGetAllStories` populate the typed thrift structs from Phase 4-B with realistic field values, sample target wire sizes from `RpcDistRegistry::inboundRequestSize()`, and pad the dominant binary field with Silesia bytes (compression-realistic). `query_id = (thread_id << 32) | session_counter++` so the leaf's `query_id >> 32` shard derivation lands all four-six RPCs for one session on the same RANKER worker. The new `StartSessionLoop` callback dispatches `RunSession` on a dedicated `CPUThreadPoolExecutor("DriverSession", num_threads)` and chains the pacing-timer rearm (`TestDriver::scheduleNextSession`) onto the SemiFuture completion. Session mode is gated by `--rpc_dist_json` — without it, the legacy `MakeRequest` path remains for backward compat during the migration.

3. First-story latency histogram (`FeedSimDriver.h`, `FeedSimDriver.cc`). Second `LatencySampler` added to `DriverStats` (`first_story_sampler_`). `recordFirstStoryLatencyNs(uint64_t)` on `TestDriver` forwards into it; `recordSessionComplete()` increments a session counter. `printStats()` emits a new "Stats for first-story latency" block with `fs_count`, `fs_sessions`, and `fs_min/avg/50p/90p/95p/99p/99.9p` lines. The `fs_*` prefix keeps `search_qps.sh`'s per-percentile greps unambiguous.

4. `search_qps.sh` parsing (`packages/feedsim/third_party/src/scripts/search_qps.sh`). Per-response latency greps are anchored on `  ` (leading whitespace) so they don't accidentally match the new first-story block. Four new CSV columns appended (`fs_50p_ms, fs_90p_ms, fs_95p_ms, fs_99p_ms`). New `fs_<percentile>` latency-type targets supported alongside the existing `<percentile>` ones (use `-s fs_95p:500` to search against the prod-equivalent first-story SLA per researcher §6).

Driver-side `RpcDistRegistry` inbound exposure (`RpcDistRegistry.h`) — added `InboundIdx` enum (5 methods), `inboundMethodNames()`, `inboundRequestSize/Response/LatencyUs(InboundIdx)` accessors, parallel `inbound_req_/resp_/lat_` arrays, and updated `load()` to populate them from the `inbound` JSON section. Programmer-B independently introduced a near-identical enum with slightly different naming (`kCreateAndPrimeSession` vs `CREATE_AND_PRIME`); kept B's naming and consumed it from `DriverNodeRank.cc` so Programmer-C can keep a single canonical version when deduplicating.

CLI flags added (`DriverNodeRankCmdline.ggo`): `--rpc_dist_json` (path, gates session mode), `--streamdata_per_session` (int, default=2; 0 randomizes uniform[1,3]), `--stream_ifr_probability` (float, default=0.045 to match prod ratio per researcher §2).

Differential Revision: D103795176
Summary:
Programmer-B scope of Phase 6 per `~/feedsim_v2/docs/phase6_researcher_notes.md` section 4. Replaces the Phase 4-B shim handlers in `LeafNodeRank.cc` with five real per-method handlers that mirror the production multifeed_aggregator pipeline (deserialize -> session lookup -> orchestrate on RANKER -> fan out work to GlobalCPUThread / SREventBase -> compress -> sendResponse). Programmer-A's parallel diff (D103795176) added the driver-side `RpcDistRegistry::InboundIdx` enum + accessors with the `kCreateAndPrimeSession` naming chosen here, so the leaf consumes them directly with no further `RpcDistRegistry` changes needed. Programmer-C will delete the legacy `DLRMRequestHandler` / `PageRankRequestHandler` / `AsyncPageRankRequestHandler` and the no-longer-needed thrift/CLI flags in Phase 6/3.

Five changes in `LeafNodeRank.cc`:

1. `SessionState` struct + per-`ThreadData` `folly::F14FastMap<int64_t, SessionState> sessions` map. The driver mints `query_id = (thread_id << 32) | session_counter` so all 4-6 inbound RPCs for one session land on the same RANKER worker — no lock needed, since each `ThreadData` is owned by a single dispatcher thread. `SessionState` carries `query_id`, `user_id`, `created_at_ns`, `session_id`, `mobile_app_version`, plus capped `stream_payloads` / `ifr_payloads` vectors so we pay the prod-equivalent memory cost without unbounded growth.

2. `CreateAndPrimeSessionRequestHandler` — synchronous on `ThriftSrv.IO`. Deserialize the typed request, mint a 32-char hex `session_id` via `makeSessionId(query_id, rng)`, insert into the per-thread `sessions` map, and return a 44 B `CreateAndPrimeSessionResponse`. No DLRM, no fanout, no compression. Latency naturally lands near the prod p50 of 3 ms (`rpc_dist.json`) from the deserialize + map insert; no artificial sleep.

3. `GetStoriesUncompressedRequestHandler` — async. Deserialize on `ThriftSrv.IO`, snapshot `mobile_app_version` into the session, then `folly::via(rankerPool)` to orchestrate. From the orchestrator we (a) `runFeatureExtraction(this_thread, story_contents)` on `GlobalCPUThread` via `folly::via(globalCpu)`, (b) `dlrmInferenceServerSide` on `GlobalCPUThread`, and (c) `issueOutboundFanout(this_thread, args.rpc_fanout_scale_arg)` on `SREventBase` (~94 RPCs at default scale=0.025). All three resolve via `folly::collectAll` before we build the response. New `generateGetStoriesResponse(query_id, num_stories, target_bytes, silesia, rng)` builds a `GetStoriesResponse` with ~100 `RankedStoryInfo`s padded with Silesia bytes so the serialized size hits the rpc_dist.json p50 of 171 KB (or the per-request sample when the inbound section is loaded). Compression runs on `GlobalCPUThread` via `serializeAndCompress`. The wire payload is the uncompressed serialized form (matching the "uncompressed" name of this method); the compressed bytes are computed for cost accounting only.

4. `GetAllStoriesRequestHandler` — async, mirrors handler 3 but with no DLRM or feature extraction (already paid by `getStoriesUncompressed` earlier in the session per researcher §4 row 3) and `issueOutboundFanout(this_thread, args.rpc_fanout_scale_arg * 0.5)` for half-scale fanout. New `generateGetAllStoriesResponse` produces ~900 `RankedStoryInfo`s targeting the rpc_dist.json p50 of 1.47 MB.

5. `StreamDataRequestHandler` — synchronous on `ThriftSrv.IO`. Deserialize, decompress `serialized_payload` (tolerates decompress failures by falling back to the raw bytes — both pay the size cost), append the decompressed bytes to `sessions[query_id].stream_payloads` (capped at 8 entries), and return a 4 B `StreamDataResponse{ack_code=0}`. `StreamIfrPriorityRankingRequestHandler` is async with an analogous shape: `folly::via(rankerPool)` -> parallel decompress on `GlobalCPUThread` + small fanout (`scale * 0.1`) on `SREventBase` -> `collectAll` -> stash bytes -> 4 B ack.

Per-thread thread-pool routing for handler 2 mirrors `~/feedsim_v2/docs/phase6_researcher_notes.md` §4 (and the `multifeed_aggregator` strobelight breakdown):
```
ThriftSrv.IO   --[deserialize, lookup session]-->
RANKER         --[orchestrate]-->
                 ├── GlobalCPUThread [runFeatureExtraction]
                 ├── GlobalCPUThread [DLRM inference]
                 └── SREventBase     [issueOutboundFanout (~94 RPCs)]
                 (await collectAll)
RANKER         --[response struct, serialize]-->
GlobalCPUThread --[compress]-->
ThriftSrv.IO   --[sendResponse]
```

Helper additions (anonymous namespace): `sendThriftResponse<T>(context, response)` (serialize+coalesce+sendResponse), `makeSessionId(query_id, rng)` (32-char hex), `inboundResponseSizeOrDefault(td, idx, fallback)` (samples `RpcDistRegistry::inboundResponseSize(idx)` when loaded, falls back to the prod p50 otherwise — works in OSS builds without an `rpc_dist.json`), `generateGetStoriesResponse` / `generateGetAllStoriesResponse` (build a typed response with Silesia-backed `story_payload` blobs sized to hit a target serialized byte count), and `serializeAndCompress<T>(resp)` (Thrift CompactSerializer + ManagedCompression ZSTD path migrated in Phase 5-C).

Deviation from spec: the `RpcDistRegistry::InboundIdx` enum + `inboundRequestSize` / `inboundResponseSize` / `inboundLatencyUs` accessors that the spec asks Programmer-B to add are already present in the head (D103795176) — Programmer-A added them and explicitly kept the `kCreateAndPrimeSession` naming convention specified for B. No additional `RpcDistRegistry` changes are made by this diff; the leaf consumes the existing accessors from D103795176 directly.

Legacy handlers (`DLRMRequestHandler`, `PageRankRequestHandler`, `AsyncPageRankRequestHandler`) and their `kPageRankRequestType` / `kDLRMRequestType` registrations are left in place per spec — Programmer-C deletes them in Phase 6/3.

Differential Revision: D103796241
Summary:
Adds a small thread-safe LatencyHistogram (log2 buckets, atomic counters) and instruments three hot paths so we can see whether issueOutboundFanout is actually parallelizing across srEventBasePool and how the mock_services-side per-request delay matches the requested latency:

1. LeafNodeRank::issueOutboundFanout — wall time from "start of fanout dispatch" to "all per-RPC futures resolved". If parallelism is healthy, this should be ≈ max(per-RPC dispatch latency).
2. LeafNodeRank per-dispatch round-trip — wall time of each td.mock_client->dispatchByEnum call from issue to its .thenValue/thenError continuation firing.
3. LeafNodeRank sampled latency — what we picked from rpc_dist.json's latency_us percentile sampler before passing to the RPC.
4. MockServiceHandler::runSimulatedRpc — both the requested latency_us (sampled by the leaf) and the actual elapsed wall time inside the handler, including spin/sleep + response generation.

Each binary's main() spawns a background thread that dumps all histograms to stderr every 10 seconds with avg / p50 / p95 / p99 (bucket upper-bound, log2) / max. Final dump on server.run() / serve() return so the last state is visible even if the periodic dump was mid-sleep.

Differential Revision: D105119227
Summary:
The debug histograms added in the previous commit show that rpc_dist.json contains very long-tail per-call latencies (p99 ~8s, max ~28s on the BGM measurement). With the leaf fanning out N RPCs per request and waiting for collectAll to resolve, the slowest call dominates the fanout latency and tanks throughput even at the lowest reasonable fanout scale (39x slowdown observed at scale=0.001 vs no-mock_services baseline).

Three tunable knobs in mock_services that let us shape the simulated delay so the head-of-line blocking goes away while preserving the per-method latency MIX (which is what we want for uArch realism):

* --latency_cap_us (default 200000 = 200ms): clamp the requested per-call latency before any spin/sleep. The 200ms default is roughly 2x the 95p of the rpc_dist.json sampler, so the median and the long body of the distribution are unchanged but the multi-second tail is removed.
* --latency_offset_us (default 0): subtract a fixed amount from the requested latency to compensate for intrinsic RPC-stack overhead the client already pays for (the leaf's dispatch_per_rpc histogram shows a ~130ms p50 vs sampled p50 of 1ms, suggesting ~100ms of per-RPC stack overhead). Defaults to 0 so the shaping is purely a cap until we tune empirically.
* --latency_skip_threshold_us (default 100): skip spin and sleep entirely when the post-cap, post-offset latency falls below this value. Avoids paying for spin-wait jitter on requests whose modeled budget is already comparable to the natural RPC round trip.

Wired through run-feedsim-multi.sh as MOCK_LATENCY_CAP_US / MOCK_LATENCY_OFFSET_US / MOCK_LATENCY_SKIP_THRESHOLD_US env vars so the operator can sweep the knobs without rebuilding.

Also adds g_handler_effective_us histogram so the dump shows the post-shaping latency alongside requested and actual.

Differential Revision: D105119224
Summary:
The original rpc_dist.json (sampled from prod multifeed_aggregator) only carried percentiles up to p99 plus max. With per-method max latencies as high as 28 seconds (mock_handler_actual measurements), linear interpolation between p99 and max severely overstates the latency in the 99th-99.99th percentile band: for streamData, p99=6.8ms but max=26.5s, so a uniform sample at p=0.995 lands on ~13s instead of the actual ~10ms territory. The result is that the leaf-side fanout (issueOutboundFanout collectAll) ends up dominated by these inflated tail samples on a sizable fraction of requests, which is what the v1 mock_services experiments observed (fanout_total p50 ≤ 16.7s on BGM).

This commit:
* Replaces rpc_dist.json with rpc_dist_v2.json, which carries explicit p99_9 and p99_99 buckets for every distribution. The narrower interpolation bands give a much more faithful reproduction of the prod tail without being dominated by single-digit outlier samples.
* Adds "p99_9" (0.9990) and "p99_99" (0.9999) to PercentileSampler::kPercentiles() so the loader picks them up. Order is preserved ascending so std::lower_bound continues to work without re-sorting.
* Updates the doc comment to reflect the new keys and explain the rationale.

Differential Revision: D105119228
Summary:
Pull Request resolved: facebookresearch#730

issueOutboundFanout previously rounded the per-method call count via std::max(1, round(perSessionCounts[i] * scale)). At the default --rpc_fanout_scale=0.025, methods with perSessionCounts() < 20 (the long-tail outbound RPCs production hits about once per 200 sessions) ended up appearing once per session. That inflated the per-method ratio of slow-but-infrequent methods relative to their production frequency, and the inflated tail samples dominated the aggregate fanout latency distribution observed by issueOutboundFanout's collectAll.

Replace the floor with a clean drop: round to the nearest integer, then continue if the result is zero. Methods whose expected per-session count is below 0.5/scale (= 20 at default scale) are skipped entirely instead of being over-represented at one call per session.

Reviewed By: YifanYuan3

Differential Revision: D105119225
@excelle08 excelle08 force-pushed the export-D105119225-to-v2-beta branch from 9b3b7f2 to 7f67cf2 Compare July 3, 2026 00:36
@meta-codesync meta-codesync Bot changed the title Drop max(1, ...) floor in issueOutboundFanout Drop max(1, ...) floor in issueOutboundFanout (#730) Jul 3, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant