Skip to content

Wire LeafNodeRank to mock_services for outbound RPC fanout (#723)#723

Closed
excelle08 wants to merge 11 commits into
facebookresearch:v2-betafrom
excelle08:export-D103772853-to-v2-beta
Closed

Wire LeafNodeRank to mock_services for outbound RPC fanout (#723)#723
excelle08 wants to merge 11 commits into
facebookresearch:v2-betafrom
excelle08:export-D103772853-to-v2-beta

Conversation

@excelle08

@excelle08 excelle08 commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary:

Phase 5 closes the loop on the prod-shaped RPC stack: LeafNodeRank now issues real outbound Thrift RPCs to the mock_services server stood up in 5/1, replacing the synthetic folly::futures::sleep(io_latency_ms) callsites in the request handlers. This is the change that actually puts the RPC stack on-CPU during a request, which is what the multifeed_aggregator profile spends most of its time in.

Generalize percentile sampling: PercentileSampler + RpcDistRegistry

RequestSizeSampler was hard-coded to one prefixed distribution per JSON file. We now need 60 distributions (20 outbound methods x {request_size, response_size, latency_us}). PercentileSampler.h is the generalized inverse-CDF sampler:

  • load(path, prefix) keeps the legacy prefixed-keys shape used by DriverNodeRank's --request_size_distribution flag.
  • loadFromDynamic(obj) accepts a bare {min,p05,...,max} object — the shape used by every per-method sub-object in rpc_dist.json.
  • sample(rng) and sampleI64(rng) cover both size and latency distributions.

RequestSizeSampler.h is now a one-line using alias so DriverNodeRank keeps building unmodified.

RpcDistRegistry.h loads rpc_dist.json once at startup and exposes the 60 outbound samplers via either MethodIdx enum or string-keyed accessor. The kPerSessionCounts table is hard-coded from the researcher notes (not parsed from the JSON) so a missing or stale rpc_dist.json cannot silently change the fanout calibration. See ~/feedsim_v2/docs/phase5_researcher_notes.md §4 for the per-method numbers.

Per-thread Thrift client: MockServicesClient

MockServicesClient wraps the generated MockServiceAsyncClient with a compile-time switch over MethodIdx. We deliberately use the named semifuture_<method>() calls rather than a single dynamic-name dispatch — preserving distinct AsyncClient::send_<method> symbols in Strobelight, which is the entire reason MockService.thrift declares 20 methods rather than one generic call().

Thread-safety strategy: one client per LeafNodeRank worker thread, each pinned to one EventBase from the SREventBase pool (the outbound-EventBase pool added in Phase 4). MockServicesClient's constructor and destructor both runInEventBaseThreadAndWait to keep the AsyncClient + RocketClientChannel on their owning EventBase thread.

Wire contract (matches what mock_services from 5/1 expects): the first 4 bytes of the request body are a big-endian uint32_t response_size. The server reads that header and sizes its response accordingly, so client and server stay in sync without an out-of-band agreement.

Fanout integration in LeafNodeRank

Three new CLI flags:

  • --rpc_dist_path: path to rpc_dist.json. Default empty — when unset, the legacy folly::futures::sleep path is preserved verbatim (regression-safety A/B comparison).
  • --mock_services_host / --mock_services_port: target (defaults 127.0.0.1:21222).
  • --rpc_fanout_scale: scale factor on per-session counts. Default 0.025 yields ~94 RPCs per inbound session (vs ~3742 at scale=1.0); the table in §4 of the researcher notes documents the calibration tradeoff.

issueOutboundFanout(td, scale) iterates the 20 methods, computes n = max(1, round(per_session_count * scale)) for each, samples request size / response size / latency from the registry, builds a request body (4-byte BE header + Silesia bytes if available, else zero-filled padding), and dispatches via the per-thread MockServicesClient. The Future resolves once folly::collectAll of all per-call futures completes.

simulateIoOrFanout(td, ...) is the drop-in replacement for folly::futures::sleep. When td.mock_client is non-null (i.e. --rpc_dist_path was set), it fans out; otherwise it sleeps. The three callsites that get this treatment are the I/O simulation in AsyncPageRankRequestHandler, DLRMRequestHandler, and the legacy sync PageRankRequestHandler. The 1ms inter-stage breather around line 1735 is intentionally left as folly::futures::sleep per the researcher notes — that's not modeled I/O.

ThreadStartup populates td.rpc_registry, td.rpc_silesia, td.mock_client, and a per-thread std::mt19937 rpc_rng. If the connection to mock_services fails at startup, the code logs and falls back to the legacy sleep path for that thread rather than aborting (defensive — a handful of slow startups shouldn't take down the whole benchmark, and the warning will surface in install logs).

Build wiring

CMake: LeafNodeRank now compiles MockServicesClient.cc and links MockService-cpp2. Added the corresponding add_dependencies(LeafNodeRank MockService-cpp2-target) so the thrift bindings are generated before LeafNodeRank starts compiling. MockServicesClient.cc includes mock_services/gen-cpp2/MockServiceAsyncClient.h directly; the existing ${CMAKE_CURRENT_SOURCE_DIR} include path makes that visible.

Test calibration

20 methods × scale=0.025 = exactly 94 calls per session (matches the researcher table — verified independently by recomputing the ceil/round). Per-method breakdown is in researcher notes §4.

Reviewed By: charles-typ

Differential Revision: D103772853

@meta-cla meta-cla Bot added the CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. label Jul 2, 2026
@meta-codesync

meta-codesync Bot commented Jul 2, 2026

Copy link
Copy Markdown

@excelle08 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D103772853.

excelle08 added a commit to excelle08/DCPerf-1 that referenced this pull request Jul 3, 2026
…esearch#723)

Summary:
Pull Request resolved: facebookresearch#723

Phase 5 closes the loop on the prod-shaped RPC stack: LeafNodeRank now issues real outbound Thrift RPCs to the mock_services server stood up in 5/1, replacing the synthetic `folly::futures::sleep(io_latency_ms)` callsites in the request handlers. This is the change that actually puts the RPC stack on-CPU during a request, which is what the multifeed_aggregator profile spends most of its time in.

# Generalize percentile sampling: PercentileSampler + RpcDistRegistry

`RequestSizeSampler` was hard-coded to one prefixed distribution per JSON file. We now need 60 distributions (20 outbound methods x {request_size, response_size, latency_us}). `PercentileSampler.h` is the generalized inverse-CDF sampler:
- `load(path, prefix)` keeps the legacy prefixed-keys shape used by DriverNodeRank's `--request_size_distribution` flag.
- `loadFromDynamic(obj)` accepts a bare `{min,p05,...,max}` object — the shape used by every per-method sub-object in `rpc_dist.json`.
- `sample(rng)` and `sampleI64(rng)` cover both size and latency distributions.

`RequestSizeSampler.h` is now a one-line `using` alias so DriverNodeRank keeps building unmodified.

`RpcDistRegistry.h` loads `rpc_dist.json` once at startup and exposes the 60 outbound samplers via either `MethodIdx` enum or string-keyed accessor. The `kPerSessionCounts` table is hard-coded from the researcher notes (not parsed from the JSON) so a missing or stale `rpc_dist.json` cannot silently change the fanout calibration. See ~/feedsim_v2/docs/phase5_researcher_notes.md §4 for the per-method numbers.

# Per-thread Thrift client: MockServicesClient

`MockServicesClient` wraps the generated `MockServiceAsyncClient` with a compile-time switch over `MethodIdx`. We deliberately use the named `semifuture_<method>()` calls rather than a single dynamic-name dispatch — preserving distinct `AsyncClient::send_<method>` symbols in Strobelight, which is the entire reason `MockService.thrift` declares 20 methods rather than one generic `call()`.

Thread-safety strategy: one client per LeafNodeRank worker thread, each pinned to one EventBase from the SREventBase pool (the outbound-EventBase pool added in Phase 4). `MockServicesClient`'s constructor and destructor both `runInEventBaseThreadAndWait` to keep the AsyncClient + RocketClientChannel on their owning EventBase thread.

Wire contract (matches what mock_services from 5/1 expects): the first 4 bytes of the request body are a big-endian `uint32_t response_size`. The server reads that header and sizes its response accordingly, so client and server stay in sync without an out-of-band agreement.

# Fanout integration in LeafNodeRank

Three new CLI flags:
- `--rpc_dist_path`: path to `rpc_dist.json`. Default empty — when unset, the legacy `folly::futures::sleep` path is preserved verbatim (regression-safety A/B comparison).
- `--mock_services_host` / `--mock_services_port`: target (defaults `127.0.0.1:21222`).
- `--rpc_fanout_scale`: scale factor on per-session counts. Default `0.025` yields ~94 RPCs per inbound session (vs ~3742 at scale=1.0); the table in §4 of the researcher notes documents the calibration tradeoff.

`issueOutboundFanout(td, scale)` iterates the 20 methods, computes `n = max(1, round(per_session_count * scale))` for each, samples request size / response size / latency from the registry, builds a request body (4-byte BE header + Silesia bytes if available, else zero-filled padding), and dispatches via the per-thread `MockServicesClient`. The Future<int> resolves once `folly::collectAll` of all per-call futures completes.

`simulateIoOrFanout(td, ...)` is the drop-in replacement for `folly::futures::sleep`. When `td.mock_client` is non-null (i.e. `--rpc_dist_path` was set), it fans out; otherwise it sleeps. The three callsites that get this treatment are the I/O simulation in `AsyncPageRankRequestHandler`, `DLRMRequestHandler`, and the legacy sync `PageRankRequestHandler`. The 1ms inter-stage breather around line 1735 is intentionally left as `folly::futures::sleep` per the researcher notes — that's not modeled I/O.

ThreadStartup populates `td.rpc_registry`, `td.rpc_silesia`, `td.mock_client`, and a per-thread `std::mt19937 rpc_rng`. If the connection to mock_services fails at startup, the code logs and falls back to the legacy sleep path for that thread rather than aborting (defensive — a handful of slow startups shouldn't take down the whole benchmark, and the warning will surface in install logs).

# Build wiring

CMake: `LeafNodeRank` now compiles `MockServicesClient.cc` and links `MockService-cpp2`. Added the corresponding `add_dependencies(LeafNodeRank MockService-cpp2-target)` so the thrift bindings are generated before LeafNodeRank starts compiling. `MockServicesClient.cc` includes `mock_services/gen-cpp2/MockServiceAsyncClient.h` directly; the existing `${CMAKE_CURRENT_SOURCE_DIR}` include path makes that visible.

# Test calibration

20 methods × scale=0.025 = exactly **94 calls per session** (matches the researcher table — verified independently by recomputing the ceil/round). Per-method breakdown is in researcher notes §4.

Reviewed By: charles-typ

Differential Revision: D103772853
@excelle08 excelle08 force-pushed the export-D103772853-to-v2-beta branch from ceab9eb to 8679e4f Compare July 3, 2026 00:36
@meta-codesync meta-codesync Bot changed the title Wire LeafNodeRank to mock_services for outbound RPC fanout Wire LeafNodeRank to mock_services for outbound RPC fanout (#723) Jul 3, 2026
excelle08 added 11 commits July 3, 2026 11:13
Summary:
Add mock feature extraction pipeline to FeedSim with large-scale code
generation for I-cache and frontend pressure. 27 genuinely diverse code
patterns (derived from studying 696 production feature extractors) generate
~700 variants × 1000 copies = ~700K unique functions at install time.

Key components:
- 6 hand-written extractors based on production leaf function profiling
- 27 pattern-specific code generators (P01-P27) producing genuinely
  different instruction sequences (different branch topologies, loop
  nesting, data access patterns, code sizes from 10 to 2300 lines)
- Flat shuffled dispatch: all copy function pointers shuffled into one
  vector, iterated sequentially per request for maximum I-cache pressure
- DLRM medium/large model generation on-server during install
- Configurable via --num_stories, --extractors_per_story, --feature_complexity

Results on T1_BGM (Bergamo, 176 cores):
  500K calls/req: L1 I-Cache MPKI 21.34 (prod target 21), IPC 0.69 (prod 0.6-0.8)
  100K calls/req: Frontend Bound 23.5%, IPC 1.22, QPS 242

Results on T11_GRC_ARM (Grace, 72 cores):
  100K calls/req: IPC 0.52, L1 I-Cache MPKI 15.91
  Medium DLRM + 100K calls: IPC 1.03 (prod target 1.05)

Differential Revision: D97022149
Summary:
Replace scalar FP transforms with integer hash operations, add data-dependent
conditional branches, eliminate FP division with integer reciprocal approximation,
deepen MockHashTable::find() call chain from 1 to 5 levels, and increase basic
block sizes with MurmurHash-style computation chains.

Results (5/7 instruction mix targets met on CPL):
- Scalar FP: 8% → 0.46% (target <3.5%) ✓
- Conditional branches: 5.28% → 10.73% (target >15%) ✗
- Near call/return: 3.93% → 0.75% (target <1.5%) ✓
- Memory (ld+st): 51.56% → 40.72% (target 40-46%) ✓
- Divider active: 11.50% → 1.13% (target <2%) ✓
- Avg BB size: 7.3 → 13.7 (LBR, target >18) ✗

QPS impact: CPL -2.2%, BGM -6.3%, Grace -3.8%.

Differential Revision: D99494831
Summary:
Replace oldisim framework internals (libevent server, pthreads, boost::lockfree) with
folly-based equivalents for LeafNodeRank and DriverNodeRank. ParentNodeRank retains
oldisim dependency.

New files:
- FeedSimProtocol.h: Wire protocol types (binary compatible with oldisim)
- FeedSimServer.{h,cc}: Server using folly::AsyncServerSocket + folly::EventBase
- FeedSimDriver.{h,cc}: Client driver with libevent for timer precision

Modified:
- LeafNodeRank.cc: Use feedsim::FeedSimServer, feedsim::RequestContext
- DriverNodeRank.cc: Use feedsim::FeedSimDriver, feedsim::TestDriver
- CMakeLists.txt: Add FeedSimFramework library, replace OLDISim link dep
- utils.h: Remove oldisim DIE() dependency
- run.sh: Change readiness check from HTTP monitor port to TCP data port

Differential Revision: D99498073
Summary:
Phase 3 of FeedSim v2 refactor. Client loads the Silesia compression corpus
(203MB, 12 files) via mmap, picks random snippets as "stories," and sends them
to the server via thrift. Server uses story content to derive data-dependent
feature extraction inputs and DLRM features instead of random data.

Changes:
- Add StoryContent/StoryBatch thrift types and story_batch field on RankingRequest
- Add SilesiaLoader.h: mmap-based corpus loader with random snippet serving
- Update DriverNodeRank to load Silesia at startup, populate stories per request
- Update LeafNodeRank to extract stories, derive features from content bytes
  (byte frequency histogram -> dense features, rolling bigram hash -> sparse)
- Rewrite DLRMRequestHandler from sync to async with folly futures
  (I/O simulation + compression + pointer chase, pipelined)
- Add storyContent/storyContentLength fields to CopyContext for extractors
- Fix feature_suite missing from ThreadData (lost during rebase)
- Fix $feature_opts not passed to LeafNodeRank in run.sh
- Fix runFeatureExtraction() never called from request handlers
- Fix ICacheBuster SIGSEGV: init moved outside PAGERANK block
- Remove ICacheBuster from DLRMRequestHandler (DLRM inference is own workload)
- Add --silesia-dir, --stories-per-request, --story-size-min/max CLI options
- Download Silesia corpus during install (x86 + aarch64)

Differential Revision: D104076734
Summary:
FeedSim's profile shows RPC at 4-5% vs production's 30-34% — partly because the benchmark sends tiny requests with no resemblance to production's payload size distribution. This diff lets the client sample a target serialized request size from a JSON percentile distribution and pad the request to hit that size.

Changes:
- Add `optional binary padding` field to `RankingRequest` thrift struct
- New `RequestSizeSampler` (header-only) loads a JSON file of percentile data (`req_size_min`, `req_size_p05`, ... `req_size_max`) and samples target sizes via inverse-CDF with linear interpolation between percentiles
- Add `--req_size_dist <json>` flag to DriverNodeRank. When present, each request is built normally, then padded with Silesia bytes (or random bytes if Silesia not loaded) to reach the sampled target size
- Plumb `--req-size-dist` through `run.sh` with auto-detection of `feed_aggregator_req_sizes.json` next to `run.sh`
- Bundle `feed_aggregator_req_sizes.json` and `feed_aggregator_resp_sizes.json` (production data from ServiceRouter) and copy them in install scripts

Differential Revision: D102693799
…onse generator

Summary:
Several cleanups in LeafNodeRank, all motivated by the leaf-only hot-func breakdown which surfaced ~15% of CPU on server-side response RNG and the misleading-named dlrmInferenceServerSideDataGeneration:

1. Split DLRM inference into two functions, both async (return folly::Future<int>):
   - dlrmInferenceServerSide: inference path where features are generated inside DLRM::infer (server-side)
   - dlrmInferenceClientSide: inference path that uses DLRM::inferWithFeatures with client-provided dense+sparse features

   The old name dlrmInferenceServerSideDataGeneration hid the actual ML inference call (this_thread.dlrm_ranker->infer) inside a function whose name suggested it was just generating feature data. The new names are honest. A shared shardInferences() helper distributes work across cpu_threads_arg.

2. Rewrite DLRMRequestHandler to be fully async with a single future chain (DLRM inference -> I/O sleep -> compression -> pointer chase -> generate+send response). The previous code blocked synchronously on the inference future via .get() before starting the rest of the pipeline. Now the handler thread returns immediately after kicking off the chain.

3. Pick the right inference function based on what the client sent: if request.dlrm_features() is set, use dlrmInferenceClientSide (no server RNG for inputs); otherwise dlrmInferenceServerSide.

4. Remove the dead `else if (g_workload_type == DLRM)` branches from PageRankRequestHandler and AsyncPageRankRequestHandler. DLRM workload requests use kDLRMRequestType, which routes to DLRMRequestHandler (registered separately in main()), so the DLRM branch in PageRank handlers was unreachable.

5. New Silesia-backed server response generator (generators/SilesiaResponseGenerator.h). When the server is started with --silesia_dir, response RankingObjects/RankingStorys are populated by slicing bytes out of the mmap'd Silesia corpus instead of running xor128() RNG. Replaces ~15% of leaf CPU previously spent on RNG (mersenne_twister, generateRandomString, xor128) with cheap memcpy from a hot mmap. The bytes have realistic entropy for downstream ZSTD compression.

6. Added LeafNodeRank --silesia_dir option and plumbed through run.sh. The same --silesia-dir flag now provides bytes both to DriverNodeRank's story_batch and to the server's response generator.

Differential Revision: D103100125
…KER, GlobalCPUThread

Summary:
LeafNodeRank's existing thread pools were anonymous from Strobelight's point of view, so the prod-vs-bench thread-pool breakdown landed almost entirely in the `Unknown` / framework-noise bucket on CPL and BGM. Per the multifeed_aggregator prod profile (~/feedsim_v2/profiles/multifeed_aggregator_main_prod/), the four hot pools are `ThriftSrv.IO`, `SREventBase{N}`, `RANKER-{N}`, and `GlobalCPUThread` (see ~/feedsim_v2/docs/phase4_researcher_notes.md section 1).

This diff is the Programmer-A half of Phase 4 (thread pools only). It renames the four existing pools to match the prod names that Strobelight categorizes, and adds two new pools (`SREventBase`, `RANKER`) that Phase 5 will start dispatching outbound RPC fanout onto. Programmer-B's diff (thrift schema + 5 new method registrations) lands as a sibling commit; via() callsites stay on their existing pool aliases so this rename is a no-op behaviorally.

Changes:
- `cpuThreadPool` is now backed by `folly::getGlobalCPUExecutor()` (the folly singleton already exposes its threads as `GlobalCPUThread` via `NamedThreadFactory("GlobalCPUThreadPool")` in `folly/executors/GlobalExecutor.cpp:65`). DO NOT instantiate a second CPU pool — that would double-count the prod `GlobalCPUThread` category. The `ThreadData::cpuThreadPool` field type changed from `shared_ptr<CPUThreadPoolExecutor>` to `shared_ptr<folly::Executor>` so the same field can hold the global singleton via an aliasing shared_ptr that owns a `folly::Executor::KeepAlive<>`. All existing `folly::via(this_thread.cpuThreadPool.get(), ...)` callsites continue to compile because `folly::via` accepts an `Executor*`.
- `srvCPUThreadPool` is now `NamedThreadFactory("RANKER")`, sized `max(1, nproc/2)` by default (matches CPL prod: 26 RANKER threads on a 52-logical-core host). Tunable via the new `--ranker_threads` flag.
- `ioThreadPool` is now `NamedThreadFactory("ThriftSrv.IO")`, sized `--io_threads`. Was previously anonymous (the kernel just labeled the threads with the executable name).
- New `srEventBasePool` (`folly::IOThreadPoolExecutor`, `NamedThreadFactory("SREventBase")`), sized `max(1, nproc * 7 / 10)` (matches CPL prod: 39 SREventBase threads on a 52-logical-core host). Tunable via the new `--sr_event_base_threads` flag. Idle in Phase 4 — Phase 5 wires the outbound mock_services fanout onto it. Threads are warmed up at server start so Strobelight sees them in steady state.
- The legacy `srvIOThreadPool` (compression dispatch) is preserved for now and retired in Phase 6 once compression callsites move onto `GlobalCPUThread` per the researcher notes.
- New CLI flags `--ranker_threads` and `--sr_event_base_threads` in `LeafNodeRankCmdline.ggo`. Default `0` means "auto-compute from `folly::available_concurrency()` per the formulas above".
- Includes added: `<folly/executors/GlobalExecutor.h>` and `<folly/system/HardwareConcurrency.h>`. Both are already on `${FOLLY_INCLUDE_DIR}` in the existing CMake target so no `CMakeLists.txt` edits were required.

Differential Revision: D103766488
Summary:
Phase 5 of the FeedSim v2 refactor needs LeafNodeRank to issue real outbound RPCs against a separate Thrift server so Strobelight categorizes the resulting CPU samples into the same rpc-stack/serialization/transport buckets as production multifeed_aggregator. This diff adds the mock_services binary that stands in for the 20 outbound RPC types observed in the production profile.

The server is built on real apache::thrift::ThriftServer, not the FeedSimServer hand-rolled AsyncServerSocket loop, so loopback dispatch goes through Cpp2Worker, RocketServerConnection, RequestRpcMetadata, CompactProtocolWriter, etc. exactly as it would in prod. The 20 thrift methods all share the same wire signature `binary <method>(1: binary request, 2: i32 latency_us)` (the design from phase5_researcher_notes section 1, option (c)) and dispatch to a single shared handler body. Distinct method names exist purely so Strobelight per-method attribution lines up with prod.

Wire contract: caller writes a uint32 big-endian response_size in the first 4 bytes of `request` and then opaque padding sized to the request percentile. The server sleeps/spins for `latency_us` and replies with `response_size` bytes copied from the Silesia corpus. Short-tail latencies (<200us) burn the IO thread to keep rpc-stack samples on-CPU; longer latencies hop to the global timekeeper.

Files added: MockService.thrift (IDL, 20 methods), MockServiceHandler.{h,cc} (single shared runSimulatedRpc body, 20 trivial wrappers behind a macro), MockServiceMain.cc (folly::Init + ThriftServer setup), BUCK (thrift_library + cpp_binary, with a -I flag pulling SilesiaLoader.h from the parent ranking/ dir since that dir has no BUCK file), CMakeLists.txt (open-source build path; mirrors the parent ranking/ pattern). Parent ranking/CMakeLists.txt picks up the new dir via add_subdirectory. The binary ships in the cea.chips.benchpress fbpkg automatically via the existing buck_filegroup glob over packages/feedsim/**.

Programmer B (sibling diff in this stack) wires LeafNodeRank's MockServiceAsyncClient and the issueOutboundFanout switch; Programmer C migrates compression to ManagedCompression. No file conflicts with this diff.

Differential Revision: D103766817
Summary:
Phase 4 programmer-B: introduce production-shaped multifeed aggregator thrift schema and dispatch IDs so the FeedSim leaf node can be exercised by per-method driver traffic in Phase 6. Builds on Phase 4-A (`11e9bc9a3431` — pool rename to ThriftSrv.IO/SREventBase/RANKER/GlobalCPUThread) and Phase 5-A (`31948cd0d579` — mock_services binary).

Three changes, all additive (no existing struct/handler is removed — Phase 6 deletes RankingRequest/RankingResponse and the legacy kPageRank/kDLRM type IDs):

1. `if/ranking.thrift` — five new request/response struct pairs sized to the p50 wire targets from `~/feedsim_v2/profiles/rpc_dist.json`:
   - CreateAndPrimeSessionRequest/Response (379 B / 44 B)
   - GetStoriesRequest/Response (2.13 MB / 171 KB) — also adds shared helpers GetStoriesResponseStats and RankedStoryInfo
   - GetAllStoriesRequest/Response (55 B / 1.47 MB)
   - StreamDataRequest/Response (58 KB / 4 B) plus StreamingUseCase enum
   - StreamIfrPriorityRankingRequest/Response (949 KB / 4 B)

   Each request struct mirrors prod field counts and types (including primitive vs container vs binary) per `~/feedsim_v2/docs/phase4_researcher_notes.md` section 3, so CompactProtocol serialization cost is realistic. Bulk wire size lives in named `binary` fields (e.g. `settings_compressed`, `serialized_payload`, `ifr_objects_serialized`) that the Phase 6 driver populates by sampling from the percentile table.

2. `RequestTypes.h` — five new uint32_t constants `0x10..0x14` for the new methods. Existing `kPageRankRequestType` (0x00) and `kDLRMRequestType` (0x01) stay so the in-flight stack keeps working.

3. `LeafNodeRank.cc` — five new shim handler functions and matching `registerQueryCallback` calls:
   - Heavy methods (`getStoriesUncompressed`, `getAllStories`) deserialize the new struct then route to the existing `DLRMRequestHandler`. Phase 4 CPU profile is unchanged for those.
   - Light methods (`createAndPrimeSession`, `streamData`, `streamIfrPriorityRanking`) deserialize, then send a small fixed-size response (44 B / 4 B / 4 B) without invoking `DLRMRequestHandler`. Production p50 latencies for these are 3-13 ms with 4-44 B responses, so attributing DLRM CPU to them in Phase 4 testing would distort the profile. Phase 6 replaces these shims with real per-method handlers (session bookkeeping, ack-only paths, IFR scoring).

Sizing methodology: targets are p50 wire sizes from `rpc_dist.json`. Computed sizes are CompactProtocol overhead (1 byte per short field tag, 2 bytes for tags >15, varint length + N bytes data for binary, ~1 byte stop) plus the binary field contents the driver supplies:

   | Method                        | Target p50 | Size source                                                          |
   |-------------------------------|-----------:|----------------------------------------------------------------------|
   | CreateAndPrimeSessionRequest  |      379 B | ~110 B field overhead + ~270 B `session_init_blob`                    |
   | CreateAndPrimeSessionResponse |       44 B | ~7 B field overhead + 32-char hex `session_id` (~36 B) + 4 B status   |
   | GetStoriesRequest             |   2.13 MB  | ~150 B fixed fields + 5 binary blobs (driver fills to ~2.07 MB total) |
   | GetStoriesResponse            |   171 KB   | ~50 B fixed + ~100 stories x ~1.5 KB story_payload + ~10 KB debug    |
   | GetAllStoriesRequest          |       55 B | 36 B session_id + 8 B query_id + ~10 B caller_id + ~6 B overhead     |
   | GetAllStoriesResponse         |   1.47 MB  | ~50 B fixed + ~500-1000 stories x ~1.5 KB + ~10 KB debug             |
   | StreamDataRequest             |    58 KB   | ~50 B fixed + driver-sampled `serialized_payload` (bimodal in prod)  |
   | StreamDataResponse            |        4 B | 1 B field header + 1 B i32 zigzag + 1 B stop = 3-4 B                 |
   | StreamIfrPriorityRankingReq   |   949 KB   | ~80 B fixed + driver-sampled `ifr_objects_serialized` etc.           |
   | StreamIfrPriorityRankingResp  |        4 B | identical encoding to StreamDataResponse                              |

Because the binary fields are sampled per-request from the percentile table (in Phase 6), every method can hit not just p50 but the entire prod distribution (p05/p25/p75/p95). The structs themselves carry no binary defaults.

Generated `gen-cpp2/ranking_types.h` is regenerated by CMake at fbpkg-install time; the checked-in copy predates `RankingRequest`/`DLRMFeatures`/`StoryBatch` and is also missing those, confirming it is rebuilt out-of-tree.

Differential Revision: D103767023
Summary:
Migrate the three ZSTD compression callsites in `LeafNodeRank.cc` from raw `folly::compression::getCodec(CodecType::ZSTD)` to ManagedCompression, the documented Meta standard for application-level compression in fbcode (per `fbcode/.llms/rules/managed_compression.md` and the `managed_compression_integration` skill). ManagedCompression handles dictionary training, parameter tuning, and rollout via infrastructure rather than hard-coded codec choice/level.

Three callsites migrated, two categories:
- `compressPayload` (line ~399) — `leaf_random_string` category
- `decompressPayload` (line ~409) — `leaf_random_string` category (same as compressPayload, since both operate on the same pseudo-random payload bytes — required so ManagedCompression serves the right dictionary on decompress)
- `compressThrift` (line ~416) — `leaf_thrift_payload` category (serialized RankingResponse / CompactProtocol — distinct payload shape)

Following the canonical pattern from `common/managed_compression/examples/ManagedCompressionExample.cpp`:
- One process-wide `folly::Singleton<ManagedCompressionFactory>` keyed by oncall=`chips_dcperf` and project=`feedsim`. Constructing a factory per call is explicitly discouraged.
- `getCachedCodec(category)` per category, preferred over `getCodec()` for hot paths.
- 2 categories, both clearly distinct payload shapes (random bytes vs thrift CompactProtocol). Per the skill, category count is kept modest.

Open-source build path: the benchpress repo is open-sourced and ManagedCompression is internal-only, so the include and use sites are gated behind `#ifdef BENCHPRESS_INTERNAL`. When the gate is undefined (the current OSS / CMake build path that all install scripts use today), the original raw folly ZSTD code remains as the fallback so the OSS build still works. `LeafNodeRank.cc` has no Buck build target — it is built only by CMake at fbpkg-install time — so no Buck-side wiring is needed in this commit. A follow-up (Phase 6) can add `-DBENCHPRESS_INTERNAL=1` to the internal CMake invocation to flip the gate on.

Stack position: depends on `bebd655f6d` (Phase 4-B thrift structs). Sibling of Phase 5-A mock_services (`bd2517aa83`).

Differential Revision: D103768051
…esearch#723)

Summary:
Pull Request resolved: facebookresearch#723

Phase 5 closes the loop on the prod-shaped RPC stack: LeafNodeRank now issues real outbound Thrift RPCs to the mock_services server stood up in 5/1, replacing the synthetic `folly::futures::sleep(io_latency_ms)` callsites in the request handlers. This is the change that actually puts the RPC stack on-CPU during a request, which is what the multifeed_aggregator profile spends most of its time in.

# Generalize percentile sampling: PercentileSampler + RpcDistRegistry

`RequestSizeSampler` was hard-coded to one prefixed distribution per JSON file. We now need 60 distributions (20 outbound methods x {request_size, response_size, latency_us}). `PercentileSampler.h` is the generalized inverse-CDF sampler:
- `load(path, prefix)` keeps the legacy prefixed-keys shape used by DriverNodeRank's `--request_size_distribution` flag.
- `loadFromDynamic(obj)` accepts a bare `{min,p05,...,max}` object — the shape used by every per-method sub-object in `rpc_dist.json`.
- `sample(rng)` and `sampleI64(rng)` cover both size and latency distributions.

`RequestSizeSampler.h` is now a one-line `using` alias so DriverNodeRank keeps building unmodified.

`RpcDistRegistry.h` loads `rpc_dist.json` once at startup and exposes the 60 outbound samplers via either `MethodIdx` enum or string-keyed accessor. The `kPerSessionCounts` table is hard-coded from the researcher notes (not parsed from the JSON) so a missing or stale `rpc_dist.json` cannot silently change the fanout calibration. See ~/feedsim_v2/docs/phase5_researcher_notes.md §4 for the per-method numbers.

# Per-thread Thrift client: MockServicesClient

`MockServicesClient` wraps the generated `MockServiceAsyncClient` with a compile-time switch over `MethodIdx`. We deliberately use the named `semifuture_<method>()` calls rather than a single dynamic-name dispatch — preserving distinct `AsyncClient::send_<method>` symbols in Strobelight, which is the entire reason `MockService.thrift` declares 20 methods rather than one generic `call()`.

Thread-safety strategy: one client per LeafNodeRank worker thread, each pinned to one EventBase from the SREventBase pool (the outbound-EventBase pool added in Phase 4). `MockServicesClient`'s constructor and destructor both `runInEventBaseThreadAndWait` to keep the AsyncClient + RocketClientChannel on their owning EventBase thread.

Wire contract (matches what mock_services from 5/1 expects): the first 4 bytes of the request body are a big-endian `uint32_t response_size`. The server reads that header and sizes its response accordingly, so client and server stay in sync without an out-of-band agreement.

# Fanout integration in LeafNodeRank

Three new CLI flags:
- `--rpc_dist_path`: path to `rpc_dist.json`. Default empty — when unset, the legacy `folly::futures::sleep` path is preserved verbatim (regression-safety A/B comparison).
- `--mock_services_host` / `--mock_services_port`: target (defaults `127.0.0.1:21222`).
- `--rpc_fanout_scale`: scale factor on per-session counts. Default `0.025` yields ~94 RPCs per inbound session (vs ~3742 at scale=1.0); the table in §4 of the researcher notes documents the calibration tradeoff.

`issueOutboundFanout(td, scale)` iterates the 20 methods, computes `n = max(1, round(per_session_count * scale))` for each, samples request size / response size / latency from the registry, builds a request body (4-byte BE header + Silesia bytes if available, else zero-filled padding), and dispatches via the per-thread `MockServicesClient`. The Future<int> resolves once `folly::collectAll` of all per-call futures completes.

`simulateIoOrFanout(td, ...)` is the drop-in replacement for `folly::futures::sleep`. When `td.mock_client` is non-null (i.e. `--rpc_dist_path` was set), it fans out; otherwise it sleeps. The three callsites that get this treatment are the I/O simulation in `AsyncPageRankRequestHandler`, `DLRMRequestHandler`, and the legacy sync `PageRankRequestHandler`. The 1ms inter-stage breather around line 1735 is intentionally left as `folly::futures::sleep` per the researcher notes — that's not modeled I/O.

ThreadStartup populates `td.rpc_registry`, `td.rpc_silesia`, `td.mock_client`, and a per-thread `std::mt19937 rpc_rng`. If the connection to mock_services fails at startup, the code logs and falls back to the legacy sleep path for that thread rather than aborting (defensive — a handful of slow startups shouldn't take down the whole benchmark, and the warning will surface in install logs).

# Build wiring

CMake: `LeafNodeRank` now compiles `MockServicesClient.cc` and links `MockService-cpp2`. Added the corresponding `add_dependencies(LeafNodeRank MockService-cpp2-target)` so the thrift bindings are generated before LeafNodeRank starts compiling. `MockServicesClient.cc` includes `mock_services/gen-cpp2/MockServiceAsyncClient.h` directly; the existing `${CMAKE_CURRENT_SOURCE_DIR}` include path makes that visible.

# Test calibration

20 methods × scale=0.025 = exactly **94 calls per session** (matches the researcher table — verified independently by recomputing the ceil/round). Per-method breakdown is in researcher notes §4.

Reviewed By: charles-typ

Differential Revision: D103772853
@excelle08 excelle08 force-pushed the export-D103772853-to-v2-beta branch from 8679e4f to 8d494bd Compare July 3, 2026 18:20
meta-codesync Bot pushed a commit that referenced this pull request Jul 4, 2026
Summary:
Pull Request resolved: #723

Phase 5 closes the loop on the prod-shaped RPC stack: LeafNodeRank now issues real outbound Thrift RPCs to the mock_services server stood up in 5/1, replacing the synthetic `folly::futures::sleep(io_latency_ms)` callsites in the request handlers. This is the change that actually puts the RPC stack on-CPU during a request, which is what the multifeed_aggregator profile spends most of its time in.

# Generalize percentile sampling: PercentileSampler + RpcDistRegistry

`RequestSizeSampler` was hard-coded to one prefixed distribution per JSON file. We now need 60 distributions (20 outbound methods x {request_size, response_size, latency_us}). `PercentileSampler.h` is the generalized inverse-CDF sampler:
- `load(path, prefix)` keeps the legacy prefixed-keys shape used by DriverNodeRank's `--request_size_distribution` flag.
- `loadFromDynamic(obj)` accepts a bare `{min,p05,...,max}` object — the shape used by every per-method sub-object in `rpc_dist.json`.
- `sample(rng)` and `sampleI64(rng)` cover both size and latency distributions.

`RequestSizeSampler.h` is now a one-line `using` alias so DriverNodeRank keeps building unmodified.

`RpcDistRegistry.h` loads `rpc_dist.json` once at startup and exposes the 60 outbound samplers via either `MethodIdx` enum or string-keyed accessor. The `kPerSessionCounts` table is hard-coded from the researcher notes (not parsed from the JSON) so a missing or stale `rpc_dist.json` cannot silently change the fanout calibration. See ~/feedsim_v2/docs/phase5_researcher_notes.md §4 for the per-method numbers.

# Per-thread Thrift client: MockServicesClient

`MockServicesClient` wraps the generated `MockServiceAsyncClient` with a compile-time switch over `MethodIdx`. We deliberately use the named `semifuture_<method>()` calls rather than a single dynamic-name dispatch — preserving distinct `AsyncClient::send_<method>` symbols in Strobelight, which is the entire reason `MockService.thrift` declares 20 methods rather than one generic `call()`.

Thread-safety strategy: one client per LeafNodeRank worker thread, each pinned to one EventBase from the SREventBase pool (the outbound-EventBase pool added in Phase 4). `MockServicesClient`'s constructor and destructor both `runInEventBaseThreadAndWait` to keep the AsyncClient + RocketClientChannel on their owning EventBase thread.

Wire contract (matches what mock_services from 5/1 expects): the first 4 bytes of the request body are a big-endian `uint32_t response_size`. The server reads that header and sizes its response accordingly, so client and server stay in sync without an out-of-band agreement.

# Fanout integration in LeafNodeRank

Three new CLI flags:
- `--rpc_dist_path`: path to `rpc_dist.json`. Default empty — when unset, the legacy `folly::futures::sleep` path is preserved verbatim (regression-safety A/B comparison).
- `--mock_services_host` / `--mock_services_port`: target (defaults `127.0.0.1:21222`).
- `--rpc_fanout_scale`: scale factor on per-session counts. Default `0.025` yields ~94 RPCs per inbound session (vs ~3742 at scale=1.0); the table in §4 of the researcher notes documents the calibration tradeoff.

`issueOutboundFanout(td, scale)` iterates the 20 methods, computes `n = max(1, round(per_session_count * scale))` for each, samples request size / response size / latency from the registry, builds a request body (4-byte BE header + Silesia bytes if available, else zero-filled padding), and dispatches via the per-thread `MockServicesClient`. The Future<int> resolves once `folly::collectAll` of all per-call futures completes.

`simulateIoOrFanout(td, ...)` is the drop-in replacement for `folly::futures::sleep`. When `td.mock_client` is non-null (i.e. `--rpc_dist_path` was set), it fans out; otherwise it sleeps. The three callsites that get this treatment are the I/O simulation in `AsyncPageRankRequestHandler`, `DLRMRequestHandler`, and the legacy sync `PageRankRequestHandler`. The 1ms inter-stage breather around line 1735 is intentionally left as `folly::futures::sleep` per the researcher notes — that's not modeled I/O.

ThreadStartup populates `td.rpc_registry`, `td.rpc_silesia`, `td.mock_client`, and a per-thread `std::mt19937 rpc_rng`. If the connection to mock_services fails at startup, the code logs and falls back to the legacy sleep path for that thread rather than aborting (defensive — a handful of slow startups shouldn't take down the whole benchmark, and the warning will surface in install logs).

# Build wiring

CMake: `LeafNodeRank` now compiles `MockServicesClient.cc` and links `MockService-cpp2`. Added the corresponding `add_dependencies(LeafNodeRank MockService-cpp2-target)` so the thrift bindings are generated before LeafNodeRank starts compiling. `MockServicesClient.cc` includes `mock_services/gen-cpp2/MockServiceAsyncClient.h` directly; the existing `${CMAKE_CURRENT_SOURCE_DIR}` include path makes that visible.

# Test calibration

20 methods × scale=0.025 = exactly **94 calls per session** (matches the researcher table — verified independently by recomputing the ceil/round). Per-method breakdown is in researcher notes §4.

Reviewed By: charles-typ

Differential Revision: D103772853

fbshipit-source-id: 4f84c6012a40e5f7b39a6529e19afda41a517d41
@excelle08 excelle08 closed this Jul 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Facebook bot. Authors need to sign the CLA before a PR can be reviewed. meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant