Channel keepalive eliminates cold-channel latency cliff (#740)#740
Open
excelle08 wants to merge 30 commits into
Open
Channel keepalive eliminates cold-channel latency cliff (#740)#740excelle08 wants to merge 30 commits into
excelle08 wants to merge 30 commits into
Conversation
Summary: Add mock feature extraction pipeline to FeedSim with large-scale code generation for I-cache and frontend pressure. 27 genuinely diverse code patterns (derived from studying 696 production feature extractors) generate ~700 variants × 1000 copies = ~700K unique functions at install time. Key components: - 6 hand-written extractors based on production leaf function profiling - 27 pattern-specific code generators (P01-P27) producing genuinely different instruction sequences (different branch topologies, loop nesting, data access patterns, code sizes from 10 to 2300 lines) - Flat shuffled dispatch: all copy function pointers shuffled into one vector, iterated sequentially per request for maximum I-cache pressure - DLRM medium/large model generation on-server during install - Configurable via --num_stories, --extractors_per_story, --feature_complexity Results on T1_BGM (Bergamo, 176 cores): 500K calls/req: L1 I-Cache MPKI 21.34 (prod target 21), IPC 0.69 (prod 0.6-0.8) 100K calls/req: Frontend Bound 23.5%, IPC 1.22, QPS 242 Results on T11_GRC_ARM (Grace, 72 cores): 100K calls/req: IPC 0.52, L1 I-Cache MPKI 15.91 Medium DLRM + 100K calls: IPC 1.03 (prod target 1.05) Differential Revision: D97022149
Summary: Replace scalar FP transforms with integer hash operations, add data-dependent conditional branches, eliminate FP division with integer reciprocal approximation, deepen MockHashTable::find() call chain from 1 to 5 levels, and increase basic block sizes with MurmurHash-style computation chains. Results (5/7 instruction mix targets met on CPL): - Scalar FP: 8% → 0.46% (target <3.5%) ✓ - Conditional branches: 5.28% → 10.73% (target >15%) ✗ - Near call/return: 3.93% → 0.75% (target <1.5%) ✓ - Memory (ld+st): 51.56% → 40.72% (target 40-46%) ✓ - Divider active: 11.50% → 1.13% (target <2%) ✓ - Avg BB size: 7.3 → 13.7 (LBR, target >18) ✗ QPS impact: CPL -2.2%, BGM -6.3%, Grace -3.8%. Differential Revision: D99494831
Summary:
Replace oldisim framework internals (libevent server, pthreads, boost::lockfree) with
folly-based equivalents for LeafNodeRank and DriverNodeRank. ParentNodeRank retains
oldisim dependency.
New files:
- FeedSimProtocol.h: Wire protocol types (binary compatible with oldisim)
- FeedSimServer.{h,cc}: Server using folly::AsyncServerSocket + folly::EventBase
- FeedSimDriver.{h,cc}: Client driver with libevent for timer precision
Modified:
- LeafNodeRank.cc: Use feedsim::FeedSimServer, feedsim::RequestContext
- DriverNodeRank.cc: Use feedsim::FeedSimDriver, feedsim::TestDriver
- CMakeLists.txt: Add FeedSimFramework library, replace OLDISim link dep
- utils.h: Remove oldisim DIE() dependency
- run.sh: Change readiness check from HTTP monitor port to TCP data port
Differential Revision: D99498073
Summary: Phase 3 of FeedSim v2 refactor. Client loads the Silesia compression corpus (203MB, 12 files) via mmap, picks random snippets as "stories," and sends them to the server via thrift. Server uses story content to derive data-dependent feature extraction inputs and DLRM features instead of random data. Changes: - Add StoryContent/StoryBatch thrift types and story_batch field on RankingRequest - Add SilesiaLoader.h: mmap-based corpus loader with random snippet serving - Update DriverNodeRank to load Silesia at startup, populate stories per request - Update LeafNodeRank to extract stories, derive features from content bytes (byte frequency histogram -> dense features, rolling bigram hash -> sparse) - Rewrite DLRMRequestHandler from sync to async with folly futures (I/O simulation + compression + pointer chase, pipelined) - Add storyContent/storyContentLength fields to CopyContext for extractors - Fix feature_suite missing from ThreadData (lost during rebase) - Fix $feature_opts not passed to LeafNodeRank in run.sh - Fix runFeatureExtraction() never called from request handlers - Fix ICacheBuster SIGSEGV: init moved outside PAGERANK block - Remove ICacheBuster from DLRMRequestHandler (DLRM inference is own workload) - Add --silesia-dir, --stories-per-request, --story-size-min/max CLI options - Download Silesia corpus during install (x86 + aarch64) Differential Revision: D104076734
Summary: FeedSim's profile shows RPC at 4-5% vs production's 30-34% — partly because the benchmark sends tiny requests with no resemblance to production's payload size distribution. This diff lets the client sample a target serialized request size from a JSON percentile distribution and pad the request to hit that size. Changes: - Add `optional binary padding` field to `RankingRequest` thrift struct - New `RequestSizeSampler` (header-only) loads a JSON file of percentile data (`req_size_min`, `req_size_p05`, ... `req_size_max`) and samples target sizes via inverse-CDF with linear interpolation between percentiles - Add `--req_size_dist <json>` flag to DriverNodeRank. When present, each request is built normally, then padded with Silesia bytes (or random bytes if Silesia not loaded) to reach the sampled target size - Plumb `--req-size-dist` through `run.sh` with auto-detection of `feed_aggregator_req_sizes.json` next to `run.sh` - Bundle `feed_aggregator_req_sizes.json` and `feed_aggregator_resp_sizes.json` (production data from ServiceRouter) and copy them in install scripts Differential Revision: D102693799
…onse generator Summary: Several cleanups in LeafNodeRank, all motivated by the leaf-only hot-func breakdown which surfaced ~15% of CPU on server-side response RNG and the misleading-named dlrmInferenceServerSideDataGeneration: 1. Split DLRM inference into two functions, both async (return folly::Future<int>): - dlrmInferenceServerSide: inference path where features are generated inside DLRM::infer (server-side) - dlrmInferenceClientSide: inference path that uses DLRM::inferWithFeatures with client-provided dense+sparse features The old name dlrmInferenceServerSideDataGeneration hid the actual ML inference call (this_thread.dlrm_ranker->infer) inside a function whose name suggested it was just generating feature data. The new names are honest. A shared shardInferences() helper distributes work across cpu_threads_arg. 2. Rewrite DLRMRequestHandler to be fully async with a single future chain (DLRM inference -> I/O sleep -> compression -> pointer chase -> generate+send response). The previous code blocked synchronously on the inference future via .get() before starting the rest of the pipeline. Now the handler thread returns immediately after kicking off the chain. 3. Pick the right inference function based on what the client sent: if request.dlrm_features() is set, use dlrmInferenceClientSide (no server RNG for inputs); otherwise dlrmInferenceServerSide. 4. Remove the dead `else if (g_workload_type == DLRM)` branches from PageRankRequestHandler and AsyncPageRankRequestHandler. DLRM workload requests use kDLRMRequestType, which routes to DLRMRequestHandler (registered separately in main()), so the DLRM branch in PageRank handlers was unreachable. 5. New Silesia-backed server response generator (generators/SilesiaResponseGenerator.h). When the server is started with --silesia_dir, response RankingObjects/RankingStorys are populated by slicing bytes out of the mmap'd Silesia corpus instead of running xor128() RNG. Replaces ~15% of leaf CPU previously spent on RNG (mersenne_twister, generateRandomString, xor128) with cheap memcpy from a hot mmap. The bytes have realistic entropy for downstream ZSTD compression. 6. Added LeafNodeRank --silesia_dir option and plumbed through run.sh. The same --silesia-dir flag now provides bytes both to DriverNodeRank's story_batch and to the server's response generator. Differential Revision: D103100125
…KER, GlobalCPUThread
Summary:
LeafNodeRank's existing thread pools were anonymous from Strobelight's point of view, so the prod-vs-bench thread-pool breakdown landed almost entirely in the `Unknown` / framework-noise bucket on CPL and BGM. Per the multifeed_aggregator prod profile (~/feedsim_v2/profiles/multifeed_aggregator_main_prod/), the four hot pools are `ThriftSrv.IO`, `SREventBase{N}`, `RANKER-{N}`, and `GlobalCPUThread` (see ~/feedsim_v2/docs/phase4_researcher_notes.md section 1).
This diff is the Programmer-A half of Phase 4 (thread pools only). It renames the four existing pools to match the prod names that Strobelight categorizes, and adds two new pools (`SREventBase`, `RANKER`) that Phase 5 will start dispatching outbound RPC fanout onto. Programmer-B's diff (thrift schema + 5 new method registrations) lands as a sibling commit; via() callsites stay on their existing pool aliases so this rename is a no-op behaviorally.
Changes:
- `cpuThreadPool` is now backed by `folly::getGlobalCPUExecutor()` (the folly singleton already exposes its threads as `GlobalCPUThread` via `NamedThreadFactory("GlobalCPUThreadPool")` in `folly/executors/GlobalExecutor.cpp:65`). DO NOT instantiate a second CPU pool — that would double-count the prod `GlobalCPUThread` category. The `ThreadData::cpuThreadPool` field type changed from `shared_ptr<CPUThreadPoolExecutor>` to `shared_ptr<folly::Executor>` so the same field can hold the global singleton via an aliasing shared_ptr that owns a `folly::Executor::KeepAlive<>`. All existing `folly::via(this_thread.cpuThreadPool.get(), ...)` callsites continue to compile because `folly::via` accepts an `Executor*`.
- `srvCPUThreadPool` is now `NamedThreadFactory("RANKER")`, sized `max(1, nproc/2)` by default (matches CPL prod: 26 RANKER threads on a 52-logical-core host). Tunable via the new `--ranker_threads` flag.
- `ioThreadPool` is now `NamedThreadFactory("ThriftSrv.IO")`, sized `--io_threads`. Was previously anonymous (the kernel just labeled the threads with the executable name).
- New `srEventBasePool` (`folly::IOThreadPoolExecutor`, `NamedThreadFactory("SREventBase")`), sized `max(1, nproc * 7 / 10)` (matches CPL prod: 39 SREventBase threads on a 52-logical-core host). Tunable via the new `--sr_event_base_threads` flag. Idle in Phase 4 — Phase 5 wires the outbound mock_services fanout onto it. Threads are warmed up at server start so Strobelight sees them in steady state.
- The legacy `srvIOThreadPool` (compression dispatch) is preserved for now and retired in Phase 6 once compression callsites move onto `GlobalCPUThread` per the researcher notes.
- New CLI flags `--ranker_threads` and `--sr_event_base_threads` in `LeafNodeRankCmdline.ggo`. Default `0` means "auto-compute from `folly::available_concurrency()` per the formulas above".
- Includes added: `<folly/executors/GlobalExecutor.h>` and `<folly/system/HardwareConcurrency.h>`. Both are already on `${FOLLY_INCLUDE_DIR}` in the existing CMake target so no `CMakeLists.txt` edits were required.
Differential Revision: D103766488
|
@excelle08 has exported this pull request. If you are a Meta employee, you can view the originating Diff in D106558423. |
excelle08
added a commit
to excelle08/DCPerf-1
that referenced
this pull request
Jul 2, 2026
…arch#740) Summary: Pull Request resolved: facebookresearch#740 Adds a per-MockServicesClient keepalive timer that fires a fire-and-forget `getStatus()` probe RPC every N milliseconds, keeping Rocket channels warm between sparse session bursts. After D105903218 distributed outbound RPCs across one MockServicesClient per SREventBase (123 channels on BGM), the t25 QPS-latency sweep surfaced a severe cold-channel anti-pattern: p95 latency at low offered load was up to 14× *worse* than at peak load on BGM (9,488 ms at q=5/inst vs 684 ms at q=35/inst), with throughput collapsing to 31% of requested (1.55 of 5 QPS). Grace and CPL showed milder 2.5-2.8× cliffs. Root cause (confirmed via perf-record diff at q=5 vs q=35 on BGM): at low offered load, ≈1.5 sessions in-flight × K=16 fanout = 24 RPCs spread across 123 channels — most channels idle for 100-300 ms between bursts. Each cold RPC then pays three compounding penalties: Rocket channel re-arm (+2.4pp Folly, +1.9pp RPC-AsyncIO at q=5), deep C-state wake on idle SREventBase cores (Zen4c C6 wake ≈100 µs; `poll_idle` + `acpi_processor_ffh_cstate_enter` in top 14 hot functions), and cold allocator/JIT caches (+4.1pp MemAlloc, +2.0pp JIT-Unresolved). All three penalties share the same proximate cause: channel idle time > 100 ms. Touching every channel every 150 ms via a cheap getStatus() probe eliminates all three penalties simultaneously. The probe runs on the channel's own EventBase (thread-affine) and is fire-and-forget (drops the returned SemiFuture); transport errors are swallowed silently to avoid keepalive failures propagating to the application path. Bandwidth cost: ~13K pings/sec/instance × ~100 bytes round-trip = ~1.3 MB/s — negligible. mock_services CPU cost ≈0.07 cores per instance. Knobs: - New CLI flag `--mock_keepalive_interval_ms` on LeafNodeRank (default 0 = disabled so the anti-pattern stays observable for regression testing). - New `MOCK_KEEPALIVE_INTERVAL_MS` env override in run.sh so sweep scripts can A/B without rebuilding. - Recommended starting value: 150 ms (validated). 300-500 ms likely sufficient; tunable. Reviewed By: YifanYuan3 Differential Revision: D106558423
9004e34 to
faa2d9a
Compare
Summary:
Phase 5 of the FeedSim v2 refactor needs LeafNodeRank to issue real outbound RPCs against a separate Thrift server so Strobelight categorizes the resulting CPU samples into the same rpc-stack/serialization/transport buckets as production multifeed_aggregator. This diff adds the mock_services binary that stands in for the 20 outbound RPC types observed in the production profile.
The server is built on real apache::thrift::ThriftServer, not the FeedSimServer hand-rolled AsyncServerSocket loop, so loopback dispatch goes through Cpp2Worker, RocketServerConnection, RequestRpcMetadata, CompactProtocolWriter, etc. exactly as it would in prod. The 20 thrift methods all share the same wire signature `binary <method>(1: binary request, 2: i32 latency_us)` (the design from phase5_researcher_notes section 1, option (c)) and dispatch to a single shared handler body. Distinct method names exist purely so Strobelight per-method attribution lines up with prod.
Wire contract: caller writes a uint32 big-endian response_size in the first 4 bytes of `request` and then opaque padding sized to the request percentile. The server sleeps/spins for `latency_us` and replies with `response_size` bytes copied from the Silesia corpus. Short-tail latencies (<200us) burn the IO thread to keep rpc-stack samples on-CPU; longer latencies hop to the global timekeeper.
Files added: MockService.thrift (IDL, 20 methods), MockServiceHandler.{h,cc} (single shared runSimulatedRpc body, 20 trivial wrappers behind a macro), MockServiceMain.cc (folly::Init + ThriftServer setup), BUCK (thrift_library + cpp_binary, with a -I flag pulling SilesiaLoader.h from the parent ranking/ dir since that dir has no BUCK file), CMakeLists.txt (open-source build path; mirrors the parent ranking/ pattern). Parent ranking/CMakeLists.txt picks up the new dir via add_subdirectory. The binary ships in the cea.chips.benchpress fbpkg automatically via the existing buck_filegroup glob over packages/feedsim/**.
Programmer B (sibling diff in this stack) wires LeafNodeRank's MockServiceAsyncClient and the issueOutboundFanout switch; Programmer C migrates compression to ManagedCompression. No file conflicts with this diff.
Differential Revision: D103766817
Summary: Phase 4 programmer-B: introduce production-shaped multifeed aggregator thrift schema and dispatch IDs so the FeedSim leaf node can be exercised by per-method driver traffic in Phase 6. Builds on Phase 4-A (`11e9bc9a3431` — pool rename to ThriftSrv.IO/SREventBase/RANKER/GlobalCPUThread) and Phase 5-A (`31948cd0d579` — mock_services binary). Three changes, all additive (no existing struct/handler is removed — Phase 6 deletes RankingRequest/RankingResponse and the legacy kPageRank/kDLRM type IDs): 1. `if/ranking.thrift` — five new request/response struct pairs sized to the p50 wire targets from `~/feedsim_v2/profiles/rpc_dist.json`: - CreateAndPrimeSessionRequest/Response (379 B / 44 B) - GetStoriesRequest/Response (2.13 MB / 171 KB) — also adds shared helpers GetStoriesResponseStats and RankedStoryInfo - GetAllStoriesRequest/Response (55 B / 1.47 MB) - StreamDataRequest/Response (58 KB / 4 B) plus StreamingUseCase enum - StreamIfrPriorityRankingRequest/Response (949 KB / 4 B) Each request struct mirrors prod field counts and types (including primitive vs container vs binary) per `~/feedsim_v2/docs/phase4_researcher_notes.md` section 3, so CompactProtocol serialization cost is realistic. Bulk wire size lives in named `binary` fields (e.g. `settings_compressed`, `serialized_payload`, `ifr_objects_serialized`) that the Phase 6 driver populates by sampling from the percentile table. 2. `RequestTypes.h` — five new uint32_t constants `0x10..0x14` for the new methods. Existing `kPageRankRequestType` (0x00) and `kDLRMRequestType` (0x01) stay so the in-flight stack keeps working. 3. `LeafNodeRank.cc` — five new shim handler functions and matching `registerQueryCallback` calls: - Heavy methods (`getStoriesUncompressed`, `getAllStories`) deserialize the new struct then route to the existing `DLRMRequestHandler`. Phase 4 CPU profile is unchanged for those. - Light methods (`createAndPrimeSession`, `streamData`, `streamIfrPriorityRanking`) deserialize, then send a small fixed-size response (44 B / 4 B / 4 B) without invoking `DLRMRequestHandler`. Production p50 latencies for these are 3-13 ms with 4-44 B responses, so attributing DLRM CPU to them in Phase 4 testing would distort the profile. Phase 6 replaces these shims with real per-method handlers (session bookkeeping, ack-only paths, IFR scoring). Sizing methodology: targets are p50 wire sizes from `rpc_dist.json`. Computed sizes are CompactProtocol overhead (1 byte per short field tag, 2 bytes for tags >15, varint length + N bytes data for binary, ~1 byte stop) plus the binary field contents the driver supplies: | Method | Target p50 | Size source | |-------------------------------|-----------:|----------------------------------------------------------------------| | CreateAndPrimeSessionRequest | 379 B | ~110 B field overhead + ~270 B `session_init_blob` | | CreateAndPrimeSessionResponse | 44 B | ~7 B field overhead + 32-char hex `session_id` (~36 B) + 4 B status | | GetStoriesRequest | 2.13 MB | ~150 B fixed fields + 5 binary blobs (driver fills to ~2.07 MB total) | | GetStoriesResponse | 171 KB | ~50 B fixed + ~100 stories x ~1.5 KB story_payload + ~10 KB debug | | GetAllStoriesRequest | 55 B | 36 B session_id + 8 B query_id + ~10 B caller_id + ~6 B overhead | | GetAllStoriesResponse | 1.47 MB | ~50 B fixed + ~500-1000 stories x ~1.5 KB + ~10 KB debug | | StreamDataRequest | 58 KB | ~50 B fixed + driver-sampled `serialized_payload` (bimodal in prod) | | StreamDataResponse | 4 B | 1 B field header + 1 B i32 zigzag + 1 B stop = 3-4 B | | StreamIfrPriorityRankingReq | 949 KB | ~80 B fixed + driver-sampled `ifr_objects_serialized` etc. | | StreamIfrPriorityRankingResp | 4 B | identical encoding to StreamDataResponse | Because the binary fields are sampled per-request from the percentile table (in Phase 6), every method can hit not just p50 but the entire prod distribution (p05/p25/p75/p95). The structs themselves carry no binary defaults. Generated `gen-cpp2/ranking_types.h` is regenerated by CMake at fbpkg-install time; the checked-in copy predates `RankingRequest`/`DLRMFeatures`/`StoryBatch` and is also missing those, confirming it is rebuilt out-of-tree. Differential Revision: D103767023
Summary: Migrate the three ZSTD compression callsites in `LeafNodeRank.cc` from raw `folly::compression::getCodec(CodecType::ZSTD)` to ManagedCompression, the documented Meta standard for application-level compression in fbcode (per `fbcode/.llms/rules/managed_compression.md` and the `managed_compression_integration` skill). ManagedCompression handles dictionary training, parameter tuning, and rollout via infrastructure rather than hard-coded codec choice/level. Three callsites migrated, two categories: - `compressPayload` (line ~399) — `leaf_random_string` category - `decompressPayload` (line ~409) — `leaf_random_string` category (same as compressPayload, since both operate on the same pseudo-random payload bytes — required so ManagedCompression serves the right dictionary on decompress) - `compressThrift` (line ~416) — `leaf_thrift_payload` category (serialized RankingResponse / CompactProtocol — distinct payload shape) Following the canonical pattern from `common/managed_compression/examples/ManagedCompressionExample.cpp`: - One process-wide `folly::Singleton<ManagedCompressionFactory>` keyed by oncall=`chips_dcperf` and project=`feedsim`. Constructing a factory per call is explicitly discouraged. - `getCachedCodec(category)` per category, preferred over `getCodec()` for hot paths. - 2 categories, both clearly distinct payload shapes (random bytes vs thrift CompactProtocol). Per the skill, category count is kept modest. Open-source build path: the benchpress repo is open-sourced and ManagedCompression is internal-only, so the include and use sites are gated behind `#ifdef BENCHPRESS_INTERNAL`. When the gate is undefined (the current OSS / CMake build path that all install scripts use today), the original raw folly ZSTD code remains as the fallback so the OSS build still works. `LeafNodeRank.cc` has no Buck build target — it is built only by CMake at fbpkg-install time — so no Buck-side wiring is needed in this commit. A follow-up (Phase 6) can add `-DBENCHPRESS_INTERNAL=1` to the internal CMake invocation to flip the gate on. Stack position: depends on `bebd655f6d` (Phase 4-B thrift structs). Sibling of Phase 5-A mock_services (`bd2517aa83`). Differential Revision: D103768051
Summary:
Phase 5 closes the loop on the prod-shaped RPC stack: LeafNodeRank now issues real outbound Thrift RPCs to the mock_services server stood up in 5/1, replacing the synthetic `folly::futures::sleep(io_latency_ms)` callsites in the request handlers. This is the change that actually puts the RPC stack on-CPU during a request, which is what the multifeed_aggregator profile spends most of its time in.
# Generalize percentile sampling: PercentileSampler + RpcDistRegistry
`RequestSizeSampler` was hard-coded to one prefixed distribution per JSON file. We now need 60 distributions (20 outbound methods x {request_size, response_size, latency_us}). `PercentileSampler.h` is the generalized inverse-CDF sampler:
- `load(path, prefix)` keeps the legacy prefixed-keys shape used by DriverNodeRank's `--request_size_distribution` flag.
- `loadFromDynamic(obj)` accepts a bare `{min,p05,...,max}` object — the shape used by every per-method sub-object in `rpc_dist.json`.
- `sample(rng)` and `sampleI64(rng)` cover both size and latency distributions.
`RequestSizeSampler.h` is now a one-line `using` alias so DriverNodeRank keeps building unmodified.
`RpcDistRegistry.h` loads `rpc_dist.json` once at startup and exposes the 60 outbound samplers via either `MethodIdx` enum or string-keyed accessor. The `kPerSessionCounts` table is hard-coded from the researcher notes (not parsed from the JSON) so a missing or stale `rpc_dist.json` cannot silently change the fanout calibration. See ~/feedsim_v2/docs/phase5_researcher_notes.md §4 for the per-method numbers.
# Per-thread Thrift client: MockServicesClient
`MockServicesClient` wraps the generated `MockServiceAsyncClient` with a compile-time switch over `MethodIdx`. We deliberately use the named `semifuture_<method>()` calls rather than a single dynamic-name dispatch — preserving distinct `AsyncClient::send_<method>` symbols in Strobelight, which is the entire reason `MockService.thrift` declares 20 methods rather than one generic `call()`.
Thread-safety strategy: one client per LeafNodeRank worker thread, each pinned to one EventBase from the SREventBase pool (the outbound-EventBase pool added in Phase 4). `MockServicesClient`'s constructor and destructor both `runInEventBaseThreadAndWait` to keep the AsyncClient + RocketClientChannel on their owning EventBase thread.
Wire contract (matches what mock_services from 5/1 expects): the first 4 bytes of the request body are a big-endian `uint32_t response_size`. The server reads that header and sizes its response accordingly, so client and server stay in sync without an out-of-band agreement.
# Fanout integration in LeafNodeRank
Three new CLI flags:
- `--rpc_dist_path`: path to `rpc_dist.json`. Default empty — when unset, the legacy `folly::futures::sleep` path is preserved verbatim (regression-safety A/B comparison).
- `--mock_services_host` / `--mock_services_port`: target (defaults `127.0.0.1:21222`).
- `--rpc_fanout_scale`: scale factor on per-session counts. Default `0.025` yields ~94 RPCs per inbound session (vs ~3742 at scale=1.0); the table in §4 of the researcher notes documents the calibration tradeoff.
`issueOutboundFanout(td, scale)` iterates the 20 methods, computes `n = max(1, round(per_session_count * scale))` for each, samples request size / response size / latency from the registry, builds a request body (4-byte BE header + Silesia bytes if available, else zero-filled padding), and dispatches via the per-thread `MockServicesClient`. The Future<int> resolves once `folly::collectAll` of all per-call futures completes.
`simulateIoOrFanout(td, ...)` is the drop-in replacement for `folly::futures::sleep`. When `td.mock_client` is non-null (i.e. `--rpc_dist_path` was set), it fans out; otherwise it sleeps. The three callsites that get this treatment are the I/O simulation in `AsyncPageRankRequestHandler`, `DLRMRequestHandler`, and the legacy sync `PageRankRequestHandler`. The 1ms inter-stage breather around line 1735 is intentionally left as `folly::futures::sleep` per the researcher notes — that's not modeled I/O.
ThreadStartup populates `td.rpc_registry`, `td.rpc_silesia`, `td.mock_client`, and a per-thread `std::mt19937 rpc_rng`. If the connection to mock_services fails at startup, the code logs and falls back to the legacy sleep path for that thread rather than aborting (defensive — a handful of slow startups shouldn't take down the whole benchmark, and the warning will surface in install logs).
# Build wiring
CMake: `LeafNodeRank` now compiles `MockServicesClient.cc` and links `MockService-cpp2`. Added the corresponding `add_dependencies(LeafNodeRank MockService-cpp2-target)` so the thrift bindings are generated before LeafNodeRank starts compiling. `MockServicesClient.cc` includes `mock_services/gen-cpp2/MockServiceAsyncClient.h` directly; the existing `${CMAKE_CURRENT_SOURCE_DIR}` include path makes that visible.
# Test calibration
20 methods × scale=0.025 = exactly **94 calls per session** (matches the researcher table — verified independently by recomputing the ceil/round). Per-method breakdown is in researcher notes §4.
Differential Revision: D103772853
Summary:
Programmer-A scope of Phase 6 per `~/feedsim_v2/docs/phase6_researcher_notes.md` sections 1, 2, 3, and 6. Server-side handlers (§4) and legacy-code deletion (§5) are split into Phase 6/2 (Programmer-B) and Phase 6/3 (Programmer-C) commits.
Four changes:
1. `FeedSimDriver` SemiFuture API (`FeedSimDriver.h`, `FeedSimDriver.cc`). New `TestDriver::sendRequestAndAwait(type, payload, length)` returns a `folly::SemiFuture<std::string>` that fulfills with the response payload bytes when the matching `ResponsePacketHeader` arrives. Implementation: per-`TestDriver` `folly::F14FastMap<uint64_t, folly::Promise<std::string>> pending_promises` keyed by `request_id`; `next_request_id` promoted from a plain `uint64_t` to `std::atomic<uint64_t>` so callers from arbitrary threads can mint IDs without contention. `event_base_once` hops the actual write onto the libevent thread (the only thread that may touch the bufferevent state). `readCb` parses the response header, copies the payload out of the libevent buffer, looks up the `request_id` under `pending_mutex`, moves the promise out, drops the lock, and calls `setValue`. The legacy fire-and-forget `sendRequest` stays in place — Phase 6-C deletes it.
2. `RunSession` orchestration (`DriverNodeRank.cc`). New `RunSession(thread_id, driver, thread_data)` builds the 4-step session pipeline that mirrors prod multifeed_aggregator: createAndPrimeSession (await) -> getStoriesUncompressed (HOLD future) + streamData * N (parallel) + optional streamIfrPriorityRanking coin flip -> await all streamData -> await getStoriesUncompressed and record first-story latency -> getAllStories (await) -> done. Encoders `encodeCreateAndPrime`, `encodeGetStories`, `encodeStreamData`, `encodeStreamIfrPriority`, `encodeGetAllStories` populate the typed thrift structs from Phase 4-B with realistic field values, sample target wire sizes from `RpcDistRegistry::inboundRequestSize()`, and pad the dominant binary field with Silesia bytes (compression-realistic). `query_id = (thread_id << 32) | session_counter++` so the leaf's `query_id >> 32` shard derivation lands all four-six RPCs for one session on the same RANKER worker. The new `StartSessionLoop` callback dispatches `RunSession` on a dedicated `CPUThreadPoolExecutor("DriverSession", num_threads)` and chains the pacing-timer rearm (`TestDriver::scheduleNextSession`) onto the SemiFuture completion. Session mode is gated by `--rpc_dist_json` — without it, the legacy `MakeRequest` path remains for backward compat during the migration.
3. First-story latency histogram (`FeedSimDriver.h`, `FeedSimDriver.cc`). Second `LatencySampler` added to `DriverStats` (`first_story_sampler_`). `recordFirstStoryLatencyNs(uint64_t)` on `TestDriver` forwards into it; `recordSessionComplete()` increments a session counter. `printStats()` emits a new "Stats for first-story latency" block with `fs_count`, `fs_sessions`, and `fs_min/avg/50p/90p/95p/99p/99.9p` lines. The `fs_*` prefix keeps `search_qps.sh`'s per-percentile greps unambiguous.
4. `search_qps.sh` parsing (`packages/feedsim/third_party/src/scripts/search_qps.sh`). Per-response latency greps are anchored on ` ` (leading whitespace) so they don't accidentally match the new first-story block. Four new CSV columns appended (`fs_50p_ms, fs_90p_ms, fs_95p_ms, fs_99p_ms`). New `fs_<percentile>` latency-type targets supported alongside the existing `<percentile>` ones (use `-s fs_95p:500` to search against the prod-equivalent first-story SLA per researcher §6).
Driver-side `RpcDistRegistry` inbound exposure (`RpcDistRegistry.h`) — added `InboundIdx` enum (5 methods), `inboundMethodNames()`, `inboundRequestSize/Response/LatencyUs(InboundIdx)` accessors, parallel `inbound_req_/resp_/lat_` arrays, and updated `load()` to populate them from the `inbound` JSON section. Programmer-B independently introduced a near-identical enum with slightly different naming (`kCreateAndPrimeSession` vs `CREATE_AND_PRIME`); kept B's naming and consumed it from `DriverNodeRank.cc` so Programmer-C can keep a single canonical version when deduplicating.
CLI flags added (`DriverNodeRankCmdline.ggo`): `--rpc_dist_json` (path, gates session mode), `--streamdata_per_session` (int, default=2; 0 randomizes uniform[1,3]), `--stream_ifr_probability` (float, default=0.045 to match prod ratio per researcher §2).
Differential Revision: D103795176
Summary:
Programmer-B scope of Phase 6 per `~/feedsim_v2/docs/phase6_researcher_notes.md` section 4. Replaces the Phase 4-B shim handlers in `LeafNodeRank.cc` with five real per-method handlers that mirror the production multifeed_aggregator pipeline (deserialize -> session lookup -> orchestrate on RANKER -> fan out work to GlobalCPUThread / SREventBase -> compress -> sendResponse). Programmer-A's parallel diff (D103795176) added the driver-side `RpcDistRegistry::InboundIdx` enum + accessors with the `kCreateAndPrimeSession` naming chosen here, so the leaf consumes them directly with no further `RpcDistRegistry` changes needed. Programmer-C will delete the legacy `DLRMRequestHandler` / `PageRankRequestHandler` / `AsyncPageRankRequestHandler` and the no-longer-needed thrift/CLI flags in Phase 6/3.
Five changes in `LeafNodeRank.cc`:
1. `SessionState` struct + per-`ThreadData` `folly::F14FastMap<int64_t, SessionState> sessions` map. The driver mints `query_id = (thread_id << 32) | session_counter` so all 4-6 inbound RPCs for one session land on the same RANKER worker — no lock needed, since each `ThreadData` is owned by a single dispatcher thread. `SessionState` carries `query_id`, `user_id`, `created_at_ns`, `session_id`, `mobile_app_version`, plus capped `stream_payloads` / `ifr_payloads` vectors so we pay the prod-equivalent memory cost without unbounded growth.
2. `CreateAndPrimeSessionRequestHandler` — synchronous on `ThriftSrv.IO`. Deserialize the typed request, mint a 32-char hex `session_id` via `makeSessionId(query_id, rng)`, insert into the per-thread `sessions` map, and return a 44 B `CreateAndPrimeSessionResponse`. No DLRM, no fanout, no compression. Latency naturally lands near the prod p50 of 3 ms (`rpc_dist.json`) from the deserialize + map insert; no artificial sleep.
3. `GetStoriesUncompressedRequestHandler` — async. Deserialize on `ThriftSrv.IO`, snapshot `mobile_app_version` into the session, then `folly::via(rankerPool)` to orchestrate. From the orchestrator we (a) `runFeatureExtraction(this_thread, story_contents)` on `GlobalCPUThread` via `folly::via(globalCpu)`, (b) `dlrmInferenceServerSide` on `GlobalCPUThread`, and (c) `issueOutboundFanout(this_thread, args.rpc_fanout_scale_arg)` on `SREventBase` (~94 RPCs at default scale=0.025). All three resolve via `folly::collectAll` before we build the response. New `generateGetStoriesResponse(query_id, num_stories, target_bytes, silesia, rng)` builds a `GetStoriesResponse` with ~100 `RankedStoryInfo`s padded with Silesia bytes so the serialized size hits the rpc_dist.json p50 of 171 KB (or the per-request sample when the inbound section is loaded). Compression runs on `GlobalCPUThread` via `serializeAndCompress`. The wire payload is the uncompressed serialized form (matching the "uncompressed" name of this method); the compressed bytes are computed for cost accounting only.
4. `GetAllStoriesRequestHandler` — async, mirrors handler 3 but with no DLRM or feature extraction (already paid by `getStoriesUncompressed` earlier in the session per researcher §4 row 3) and `issueOutboundFanout(this_thread, args.rpc_fanout_scale_arg * 0.5)` for half-scale fanout. New `generateGetAllStoriesResponse` produces ~900 `RankedStoryInfo`s targeting the rpc_dist.json p50 of 1.47 MB.
5. `StreamDataRequestHandler` — synchronous on `ThriftSrv.IO`. Deserialize, decompress `serialized_payload` (tolerates decompress failures by falling back to the raw bytes — both pay the size cost), append the decompressed bytes to `sessions[query_id].stream_payloads` (capped at 8 entries), and return a 4 B `StreamDataResponse{ack_code=0}`. `StreamIfrPriorityRankingRequestHandler` is async with an analogous shape: `folly::via(rankerPool)` -> parallel decompress on `GlobalCPUThread` + small fanout (`scale * 0.1`) on `SREventBase` -> `collectAll` -> stash bytes -> 4 B ack.
Per-thread thread-pool routing for handler 2 mirrors `~/feedsim_v2/docs/phase6_researcher_notes.md` §4 (and the `multifeed_aggregator` strobelight breakdown):
```
ThriftSrv.IO --[deserialize, lookup session]-->
RANKER --[orchestrate]-->
├── GlobalCPUThread [runFeatureExtraction]
├── GlobalCPUThread [DLRM inference]
└── SREventBase [issueOutboundFanout (~94 RPCs)]
(await collectAll)
RANKER --[response struct, serialize]-->
GlobalCPUThread --[compress]-->
ThriftSrv.IO --[sendResponse]
```
Helper additions (anonymous namespace): `sendThriftResponse<T>(context, response)` (serialize+coalesce+sendResponse), `makeSessionId(query_id, rng)` (32-char hex), `inboundResponseSizeOrDefault(td, idx, fallback)` (samples `RpcDistRegistry::inboundResponseSize(idx)` when loaded, falls back to the prod p50 otherwise — works in OSS builds without an `rpc_dist.json`), `generateGetStoriesResponse` / `generateGetAllStoriesResponse` (build a typed response with Silesia-backed `story_payload` blobs sized to hit a target serialized byte count), and `serializeAndCompress<T>(resp)` (Thrift CompactSerializer + ManagedCompression ZSTD path migrated in Phase 5-C).
Deviation from spec: the `RpcDistRegistry::InboundIdx` enum + `inboundRequestSize` / `inboundResponseSize` / `inboundLatencyUs` accessors that the spec asks Programmer-B to add are already present in the head (D103795176) — Programmer-A added them and explicitly kept the `kCreateAndPrimeSession` naming convention specified for B. No additional `RpcDistRegistry` changes are made by this diff; the leaf consumes the existing accessors from D103795176 directly.
Legacy handlers (`DLRMRequestHandler`, `PageRankRequestHandler`, `AsyncPageRankRequestHandler`) and their `kPageRankRequestType` / `kDLRMRequestType` registrations are left in place per spec — Programmer-C deletes them in Phase 6/3.
Differential Revision: D103796241
Summary: Adds a small thread-safe LatencyHistogram (log2 buckets, atomic counters) and instruments three hot paths so we can see whether issueOutboundFanout is actually parallelizing across srEventBasePool and how the mock_services-side per-request delay matches the requested latency: 1. LeafNodeRank::issueOutboundFanout — wall time from "start of fanout dispatch" to "all per-RPC futures resolved". If parallelism is healthy, this should be ≈ max(per-RPC dispatch latency). 2. LeafNodeRank per-dispatch round-trip — wall time of each td.mock_client->dispatchByEnum call from issue to its .thenValue/thenError continuation firing. 3. LeafNodeRank sampled latency — what we picked from rpc_dist.json's latency_us percentile sampler before passing to the RPC. 4. MockServiceHandler::runSimulatedRpc — both the requested latency_us (sampled by the leaf) and the actual elapsed wall time inside the handler, including spin/sleep + response generation. Each binary's main() spawns a background thread that dumps all histograms to stderr every 10 seconds with avg / p50 / p95 / p99 (bucket upper-bound, log2) / max. Final dump on server.run() / serve() return so the last state is visible even if the periodic dump was mid-sleep. Differential Revision: D105119227
Summary: The debug histograms added in the previous commit show that rpc_dist.json contains very long-tail per-call latencies (p99 ~8s, max ~28s on the BGM measurement). With the leaf fanning out N RPCs per request and waiting for collectAll to resolve, the slowest call dominates the fanout latency and tanks throughput even at the lowest reasonable fanout scale (39x slowdown observed at scale=0.001 vs no-mock_services baseline). Three tunable knobs in mock_services that let us shape the simulated delay so the head-of-line blocking goes away while preserving the per-method latency MIX (which is what we want for uArch realism): * --latency_cap_us (default 200000 = 200ms): clamp the requested per-call latency before any spin/sleep. The 200ms default is roughly 2x the 95p of the rpc_dist.json sampler, so the median and the long body of the distribution are unchanged but the multi-second tail is removed. * --latency_offset_us (default 0): subtract a fixed amount from the requested latency to compensate for intrinsic RPC-stack overhead the client already pays for (the leaf's dispatch_per_rpc histogram shows a ~130ms p50 vs sampled p50 of 1ms, suggesting ~100ms of per-RPC stack overhead). Defaults to 0 so the shaping is purely a cap until we tune empirically. * --latency_skip_threshold_us (default 100): skip spin and sleep entirely when the post-cap, post-offset latency falls below this value. Avoids paying for spin-wait jitter on requests whose modeled budget is already comparable to the natural RPC round trip. Wired through run-feedsim-multi.sh as MOCK_LATENCY_CAP_US / MOCK_LATENCY_OFFSET_US / MOCK_LATENCY_SKIP_THRESHOLD_US env vars so the operator can sweep the knobs without rebuilding. Also adds g_handler_effective_us histogram so the dump shows the post-shaping latency alongside requested and actual. Differential Revision: D105119224
Summary: The original rpc_dist.json (sampled from prod multifeed_aggregator) only carried percentiles up to p99 plus max. With per-method max latencies as high as 28 seconds (mock_handler_actual measurements), linear interpolation between p99 and max severely overstates the latency in the 99th-99.99th percentile band: for streamData, p99=6.8ms but max=26.5s, so a uniform sample at p=0.995 lands on ~13s instead of the actual ~10ms territory. The result is that the leaf-side fanout (issueOutboundFanout collectAll) ends up dominated by these inflated tail samples on a sizable fraction of requests, which is what the v1 mock_services experiments observed (fanout_total p50 ≤ 16.7s on BGM). This commit: * Replaces rpc_dist.json with rpc_dist_v2.json, which carries explicit p99_9 and p99_99 buckets for every distribution. The narrower interpolation bands give a much more faithful reproduction of the prod tail without being dominated by single-digit outlier samples. * Adds "p99_9" (0.9990) and "p99_99" (0.9999) to PercentileSampler::kPercentiles() so the loader picks them up. Order is preserved ascending so std::lower_bound continues to work without re-sorting. * Updates the doc comment to reflect the new keys and explain the rationale. Differential Revision: D105119228
Summary: issueOutboundFanout previously rounded the per-method call count via std::max(1, round(perSessionCounts[i] * scale)). At the default --rpc_fanout_scale=0.025, methods with perSessionCounts() < 20 (the long-tail outbound RPCs production hits about once per 200 sessions) ended up appearing once per session. That inflated the per-method ratio of slow-but-infrequent methods relative to their production frequency, and the inflated tail samples dominated the aggregate fanout latency distribution observed by issueOutboundFanout's collectAll. Replace the floor with a clean drop: round to the nearest integer, then continue if the result is zero. Methods whose expected per-session count is below 0.5/scale (= 20 at default scale) are skipped entirely instead of being over-represented at one call per session. Differential Revision: D105119225
Summary: The srvIOThread pool was a synthetic CPU placeholder for outbound RPC work back when LeafNodeRank had no real outbound RPCs: each request would fan out N tasks onto srvIOThread, each of which generated a fake response, serialized it, ran ZSTD over half the chain, and discarded the result. With Phase 5 in place, the SREventBase pool now carries real outbound RPC fanout to mock_services (issueOutboundFanout), so the placeholder is redundant and was actively skewing the prod-vs-bench category breakdown by overweighting Compression (8-13% bench vs 3-5% prod). This commit: * Drops the srvIOThread pool construction, warmup, and the --srv_io_threads CLI flag. * Removes the per-request throw-away generateResponse + serializePayload + compressThrift fanout from AsyncPageRankRequestHandler, DLRMRequestHandler (async chain), and PageRankRequestHandler (sync). The single response we actually send back to the driver is still generated and serialized — just once instead of srv_io_threads times. * Removes the unused dispatcher-thread `compressed = compressPayload(...)` / `decompressPayload(compressed)` pair from the sync PageRankRequestHandler that was paired with the same throw-away pattern. * Removes srvIOThreadPool from ThreadData and from both ThreadStartup overloads. * Removes -s / --srv_io_threads handling from packages/feedsim/run.sh and the static --srv_io_threads=36 from start_leaf_node_rank.sh. The SREventBase pool's comment is updated from "idle in Phase 4" to its actual role (carries outbound fanout), and the startup banner now lists 4 pools instead of 5. Differential Revision: D105119222
Summary: packages/feedsim/run.sh now honors an RPC_FANOUT_SCALE env var. When set, the value is forwarded to LeafNodeRank as --rpc_fanout_scale=<value>, overriding the binary's compiled-in default of 0.025. Lets sweep scripts A/B different fanout intensities (e.g. 0.025 vs 0.05 to compare RPC overhead) without rebuilding LeafNodeRank. Differential Revision: D105119219
Summary:
DLRMRequestHandler used to call runFeatureExtraction synchronously on the dispatcher (ThriftSrv.IO) thread, blocking the IO loop for the duration of feature extraction. Worse, feedsim_autoscale_dlrm_mini didn't pass --feature-extractors at all, so on the fixed-QPS variant the call was a no-op anyway and FeatureExtraction's CPU share dropped to ~0% (vs production's 30-35%). The recent higher-QPS sweeps showed BGM at 80% idle past saturation — there's plenty of CPU room to host real feature extraction work, and getting the share up is required to match the production multifeed_aggregator microarchitectural profile.
This commit reshapes DLRMRequestHandler's pipeline:
* runFeatureExtraction is dispatched async via the same RANKER -> GlobalCPUThread pattern that GetStoriesUncompressedRequestHandler already uses (folly::via(srvCPUThreadPool, lambda { return folly::via(cpuThreadPool, runFeatureExtraction); })). RANKER orchestrates; GlobalCPUThread runs the generated vc_* extractor functions. The dispatcher returns immediately.
* Feature extraction starts at the head of the async chain in parallel with the DLRM inference future that was kicked off above. Both run on cpuThreadPool (folly::CPUThreadPoolExecutor is multi-threaded so the two stages overlap). After feature extraction completes, the chain waits for inference, then proceeds to simulateIoOrFanout / pointer chase / response generation.
* feedsim_autoscale_dlrm_mini now passes --feature-extractors --feature-complexity=5 --num-stories=100 --extractors-per-story=50 by default. That's 5000 extractor calls per request, the count tuned during Phase 2 to match production's instruction mix. If the resulting CPU share doesn't reach the 30-35% target, the knobs are tunable per-run via -i.
Differential Revision: D105119220
…stories, 8 inferences) Summary: The first sweep with async feature extraction (D...c58bc574d94c stack tip prior) confirmed the runFlatExtractors fanout reaches GlobalCPUThread correctly, but the dlrm_mini defaults left two large gaps vs production multifeed_aggregator's strobelight breakdown: | Category | Bench % non-idle (BGM q150 s05) | Prod target | |-------------------|---------------------------------|-------------| | DLRM-Inference | 61.6% | 7-13% | | FeatureExtraction | 1.9% | 30-35% | Two knob changes get both metrics moving toward prod: 1. dlrm_inferences=64 -> 8. Shrinks the per-request DLRM matmul work 8x. Should drop the DLRM-Inference share from ~55-62% toward prod's 7-13% Ranking-Prediction share. 2. num_stories=100 -> 1800. With extractors_per_story=50 unchanged, that's 90K extractor calls/req (was 5K). Should push FeatureExtraction CPU share from 1.7-1.9% toward prod's 28-32% Ranking-FeatureExtraction share. The system has plenty of CPU headroom (BGM 80% idle past saturation, CPL 60% idle), so adding 90K extractor calls/req shouldn't break throughput. Both knobs only affect feedsim_autoscale_dlrm_mini. The full feedsim_autoscale_dlrm job keeps its existing config. Knobs remain overridable per-run via -i. Differential Revision: D105119226
Summary: EVENTBASE_THREADS_DEFAULT in run.sh has been hard-coded at 4 since the original oldisim-based feedsim. That made sense back when the only outbound work was a synthetic folly::futures::sleep — 4 EventBases handling sleep timers were plenty. After Phase 5 (mock_services + issueOutboundFanout) each dispatcher's outbound RPC fanout is pinned to a single MockServicesClient on a single EventBase, so the system effectively had only 4 active EventBases serving all outbound RPCs. With --rpc_fanout_scale=0.10 producing ~376 RPCs per request, those 4 EBs queue-saturate hard: dispatch_per_rpc averaged 753ms while mock_handler_actual averaged 4.2ms — the 749ms gap is pure EventBase queueing. The SREventBase pool is sized 0.7*nproc precisely to host the outbound fanout, but with only 4 dispatcher EventBases owning MockClient instances, only 4 of the 61 SREventBase threads on BGM see actual work. Raising EVENTBASE_THREADS_DEFAULT to nproc spreads dispatcher EventBases across all cores so the per-thread MockServicesClient instances also span the SREventBase pool, fixing the 4-vs-61 mismatch. This is also a long-standing flaw in the original feedsim that becomes visible only once outbound RPC work is realistic — keeping 4 was masking the bottleneck behind synthetic sleeps. Differential Revision: D105119223
…K=32) Summary: issueOutboundFanout used to issue all ~376 RPCs (at --rpc_fanout_scale=0.10) in a tight loop, queueing them all onto the per-thread MockServicesClient EventBase at once. Even with the io_threads=88 fix spreading dispatcher EBs across the SREventBase pool, each individual request still produced a 376-RPC burst on one EB. Debug histograms showed the consequence: dispatch_per_rpc averaged ~500ms while mock_handler_actual averaged ~4ms — the 496ms gap is pure EB queue depth. Switch issueOutboundFanout to folly::window(executor=srEventBasePool, specs, fn, K=32). window() issues K up-front and refills as each completes, capping per-request EB queue depth at K instead of N. To keep the windowed lambda thread-safe without taking RNG locks, the per-RPC sampling (req_size, resp_size, lat_us) and Silesia-padded request body construction now happens up-front on the dispatcher thread; the lambda only does the dispatchByEnum() call and the histogram bookkeeping. K=32 was picked to be small enough to keep the EB queue shallow (~150ms drain at 5ms/RPC) while still amortizing window's per-call coordination overhead. Differential Revision: D105119221
Summary:
Adds two env-gated knobs to FeedSimDriver.
DCPERF_DRIVER_QPS_TRACE=1 spawns a 1Hz trace thread that logs per-thread sent/completed/in_flight counters to /tmp/driver_qps_trace_<pid>.log (or DCPERF_DRIVER_QPS_TRACE_FILE if set). Used to visualize whether the driver is actually pacing smoothly, hitting the connection cap, or sitting on completion gaps. Writes only to the file — no stderr output (stderr would interleave with cout's fully-buffered `final requested_qps=X measured_qps=Y latency=Z` line and break the search_qps.sh regex parser).
DCPERF_DRIVER_INFLIGHT_CAP=N applies a per-thread soft cap on (sent - completed). When (in_flight >= cap), nextRequestCb skips the firing and re-arms the evtimer for the same delay. **Default 0 = disabled.**
**Important caveat for INFLIGHT_CAP:** empirical testing (2026-05-19 t18 sweep on BGM, 11 iters at qps={60,80,100,120,150}) shows that small cap values like 8 reduce total throughput by ~25% without any latency benefit. QPS-trace inspection shows the cap never actually triggers because the natural per-thread inflight ceiling is `num_conns × depth = 4`, well below cap=8. The throughput penalty appears to come from per-firing atomic-load overhead in nextRequestCb. **Recommendation: leave disabled (cap=0) by default; if used at all, set cap=32+ for explicit latency-shaping experiments only.**
Implementation:
- DriverStats: new completed_count_ counter, bumped in logResponse; new getSentCount() / getCompletedCount() accessors. Aligned uint64 reads from another thread are atomic on x86-64 / aarch64, so no extra synchronization needed for a 1Hz poll.
- FeedSimDriver::Impl: trace_thread + trace_running atomic; joined in shutdown() step 5 after libevent loop break.
Differential Revision: D105659811
Summary: Two small platform-tuning changes: 1. LeafNodeRank.cc: kOutboundFanoutWindow 32 → 16. With nproc-sized ThriftSrv.IO pool (D105119223) and per-MockClient EventBase placement, 32 was overbatching outbound RPCs per request and inflating dispatch_per_rpc tail latency on BGM/Turin (88-176 cores). 16 keeps a sane queue depth on big-SMT hosts without meaningful throughput loss on smaller boxes. 2. run.sh: DRIVER_THREADS unified at nproc/4 across SMT-on and SMT-off. Previously SMT-on used nproc/5, SMT-off used nproc/4 — an inconsistency that under-pinned driver work on big SMT hosts (e.g. BGM 176 logical: 35 SMT-on driver threads vs 44 SMT-off). Using nproc/4 in both branches removes one source of cross-platform variance. Differential Revision: D105659809
Summary: Two layers needed to prevent libtorch's thread-pool explosion under concurrent forward() invocations: 1. dlrm.cpp: at::set_num_interop_threads(1) called in the DLRM constructor BEFORE loadModel + warmup. Caps libtorch's native parallel backend pool. 2. run.sh: OMP_NUM_THREADS=1 added to the leaf launch env. Caps libtorch's OpenMP parallel backend pool — our internal libtorch build uses OpenMP for tensor ops, which at::set_num_interop_threads does NOT cover. Also: per-thread JIT Module clone in dlrm.cpp. The shared pimpl_->model.forward() was racing under concurrent invocation from multiple GlobalCPUThread workers, producing SIGSEGV in je_large_dalloc → torch::autograd::autogradNotImplementedFallbackImpl → at::arange → JIT interpreter. Each ThreadState now owns a deep clone (via Module::clone()), so concurrent forward() touches disjoint interpreter state. Without these three fixes, on BGM (176 logical cores) the leaf process accumulated 30,976 = nproc^2 threads named "GlobalCPUThread" (folly NamedThreadFactory pool name, comm-truncated to 15 chars), all stuck in __futex_wait, eventually triggering kernel-scheduler thrash and the cascading deadlock that pinned the driver's inflight count at the connection cap (sent_qps=0 forever). t15 measured per-instance thread count drop: 30,976 → 88. t17 confirmed zero SIGSEGV across 16 iters at qps=80 (vs 6/16 crashes without the Module clone). Multi-iter qps=80 stability went from 1/4 balanced to 2/4 (rtptest3440 s=0.025) and 4/4 (rtptest3424 s=0.10). Differential Revision: D105659812
Summary: Prior: run-feedsim-multi.sh started ONE mock_services on port 21222 that all colocated LeafNodeRank instances shared. Two side effects: 1. All outbound RPC fanout from every feedsim instance funnels to one mock_services process, putting all instances on a shared contention queue. 2. mock_services threads ran without CPU affinity, competing freely with the tasksetted feedsim instances for CPU scheduler decisions. Now: one mock_services per feedsim instance: - port = MOCK_SERVICES_PORT_BASE + (instance_index - 1) so instance i talks to its own mock on 21222 + (i-1). - tasksetted to the SAME CPU range as its feedsim instance, so the two processes share L1/L2/L3 + memory bandwidth but DON'T share queue depth with the other instance's mock_services. - mock_io_threads = cores_per_instance (computed from CORE_RANGE) instead of full nproc, sized to its instance's CPU share. run.sh threads MOCK_SERVICES_PORT through to LeafNodeRank's --mock_services_port. Defaults to 21222 for back-compat with single-instance manual runs and the existing per-instance default. Trade-off: 2x mock_services memory (Silesia corpus loaded twice on a 2-instance host). On BGM 251GB this is negligible. The benefit is cross-instance interference isolation — eliminates one of the shared-resource hypotheses for the per-iter QPS imbalance observed in the t4 sweep. Differential Revision: D105659810
…stprocessing logs Summary: Three related changes so perfpub can identify the actual benchmarking phase and report metrics filtered to that window (instead of averaging across warmup + drain + idle periods). 1. jobs.yml: add `benchmarks/feedsim/breakdown.csv` to the copymove `after:` list for feedsim_autoscale_dlrm. The other feedsim jobs (search-mode, mini, etc.) already had it; only the DLRM autoscale variant was missing. With this, perfpub's automatic breakdown-based metric filtering kicks in for `./perfpub --no-xdb --no-manifold --dir benchmark_metrics_<id>` (see perfpub/README.md "Automatic Metric Filtering with breakdown.csv"). 2. run.sh: add preprocessing/postprocessing log entries around the existing search_qps.sh main_benchmark window. search_qps.sh already logs main_benchmark start/end around `sleep $experiment_time` (the actual measurement window); this commit adds the surrounding phase info so the full timeline is visible in breakdown.csv: - preprocessing = run.sh start → driver launch (server bring-up, graph/model load, warmup) - main_benchmark = experiment_time window (unchanged) - postprocessing = driver exit → run.sh exit (queue drain, leaf shutdown) perfpub uses only main_benchmark for default filtering; pre/post are informational for humans inspecting the CSV. 3. runtime_breakdown_utils.sh: make `create_breakdown_csv` idempotent (skip if file exists). Without this, multi-instance feedsim runs (run-feedsim-multi.sh launching N instances of run.sh) race on truncating the CSV — the second instance's create_breakdown_csv call would silently drop the first instance's already-logged entries. Verified t20 q=100/q=120 runs now record both inst1+inst2 preprocessing/main_benchmark/postprocessing entries cleanly. benchpress's copymove hook (is_move: true) moves the file out between iterations, so stale data across runs is not a concern. Verified on rtptest3440 BGM (qps=100, 120 each at 180s/300s): - breakdown.csv lands in benchmark_metrics_<id>/ after each run - `./perfpub --no-xdb --no-manifold --dir benchmark_metrics_<id>` succeeds and emits overall-metrics.csv with the breakdown-filtered window - All 4 expected operation entries present per instance: preprocessing start/end, main_benchmark start/end, postprocessing start/end (no race-induced drops) Differential Revision: D105756766
Summary: Each LeafNodeRank dispatcher thread now holds one MockServicesClient per SREventBase (was: one client total, pinned to a single EB). issueOutboundFanout round-robins RPCs across the vector via an atomic counter, so outbound fanout spreads across every EB in the pool instead of funneling through a single Rocket channel. Experimental patch to test the hypothesis that the single-channel-per-dispatcher design is the root cause of the Grace 380ms dispatch_per_rpc vs 4ms mock_handler_actual gap and BGM's low steady-state CPU utilization. If dispatch_per_rpc collapses toward mock_handler_actual after this change, the bottleneck is confirmed as per-channel serialization, not thread count. Differential Revision: D105903218
…arch#740) Summary: Pull Request resolved: facebookresearch#740 Adds a per-MockServicesClient keepalive timer that fires a fire-and-forget `getStatus()` probe RPC every N milliseconds, keeping Rocket channels warm between sparse session bursts. After D105903218 distributed outbound RPCs across one MockServicesClient per SREventBase (123 channels on BGM), the t25 QPS-latency sweep surfaced a severe cold-channel anti-pattern: p95 latency at low offered load was up to 14× *worse* than at peak load on BGM (9,488 ms at q=5/inst vs 684 ms at q=35/inst), with throughput collapsing to 31% of requested (1.55 of 5 QPS). Grace and CPL showed milder 2.5-2.8× cliffs. Root cause (confirmed via perf-record diff at q=5 vs q=35 on BGM): at low offered load, ≈1.5 sessions in-flight × K=16 fanout = 24 RPCs spread across 123 channels — most channels idle for 100-300 ms between bursts. Each cold RPC then pays three compounding penalties: Rocket channel re-arm (+2.4pp Folly, +1.9pp RPC-AsyncIO at q=5), deep C-state wake on idle SREventBase cores (Zen4c C6 wake ≈100 µs; `poll_idle` + `acpi_processor_ffh_cstate_enter` in top 14 hot functions), and cold allocator/JIT caches (+4.1pp MemAlloc, +2.0pp JIT-Unresolved). All three penalties share the same proximate cause: channel idle time > 100 ms. Touching every channel every 150 ms via a cheap getStatus() probe eliminates all three penalties simultaneously. The probe runs on the channel's own EventBase (thread-affine) and is fire-and-forget (drops the returned SemiFuture); transport errors are swallowed silently to avoid keepalive failures propagating to the application path. Bandwidth cost: ~13K pings/sec/instance × ~100 bytes round-trip = ~1.3 MB/s — negligible. mock_services CPU cost ≈0.07 cores per instance. Knobs: - New CLI flag `--mock_keepalive_interval_ms` on LeafNodeRank (default 0 = disabled so the anti-pattern stays observable for regression testing). - New `MOCK_KEEPALIVE_INTERVAL_MS` env override in run.sh so sweep scripts can A/B without rebuilding. - Recommended starting value: 150 ms (validated). 300-500 ms likely sufficient; tunable. Reviewed By: YifanYuan3 Differential Revision: D106558423
faa2d9a to
ff48694
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Adds a per-MockServicesClient keepalive timer that fires a fire-and-forget
getStatus()probe RPC every N milliseconds, keeping Rocket channels warm between sparse session bursts.After D105903218 distributed outbound RPCs across one MockServicesClient per SREventBase (123 channels on BGM), the t25 QPS-latency sweep surfaced a severe cold-channel anti-pattern: p95 latency at low offered load was up to 14× worse than at peak load on BGM (9,488 ms at q=5/inst vs 684 ms at q=35/inst), with throughput collapsing to 31% of requested (1.55 of 5 QPS). Grace and CPL showed milder 2.5-2.8× cliffs.
Root cause (confirmed via perf-record diff at q=5 vs q=35 on BGM): at low offered load, ≈1.5 sessions in-flight × K=16 fanout = 24 RPCs spread across 123 channels — most channels idle for 100-300 ms between bursts. Each cold RPC then pays three compounding penalties: Rocket channel re-arm (+2.4pp Folly, +1.9pp RPC-AsyncIO at q=5), deep C-state wake on idle SREventBase cores (Zen4c C6 wake ≈100 µs;
poll_idle+acpi_processor_ffh_cstate_enterin top 14 hot functions), and cold allocator/JIT caches (+4.1pp MemAlloc, +2.0pp JIT-Unresolved). All three penalties share the same proximate cause: channel idle time > 100 ms.Touching every channel every 150 ms via a cheap getStatus() probe eliminates all three penalties simultaneously. The probe runs on the channel's own EventBase (thread-affine) and is fire-and-forget (drops the returned SemiFuture); transport errors are swallowed silently to avoid keepalive failures propagating to the application path.
Bandwidth cost: ~13K pings/sec/instance × ~100 bytes round-trip = ~1.3 MB/s — negligible. mock_services CPU cost ≈0.07 cores per instance.
Knobs:
--mock_keepalive_interval_mson LeafNodeRank (default 0 = disabled so the anti-pattern stays observable for regression testing).MOCK_KEEPALIVE_INTERVAL_MSenv override in run.sh so sweep scripts can A/B without rebuilding.Reviewed By: YifanYuan3
Differential Revision: D106558423