race/fsdp: real transformer compute + observable per-layer checksum detector#211
Merged
Merged
Conversation
…etector Builds on PR #210 (shared-weight transformer + int16 checksums). Three changes: 1. compute_type=transformer now runs a REAL transformer block borrowed from aorta.models.RepeatedTransformerBlock (the model llm_determinism uses: real MHA + GLU FFN + LayerNorm) instead of a single gelu(W @ x) matmul. One shared block (fixed seed, rank-identical via fork_rng) + a fixed 3D reference input keeps every layer's output byte-identical, preserving the per-layer checksum invariant while supplying real-model L2/HBM pressure. Explicit all_gather/reduce_scatter (the AINIC data path) are untouched. 2. Observability: a clean run previously emitted no checksum signal, so green was indistinguishable from a no-op. Track layers_verified and layer_checksum_mismatches and surface them (plus compute_type) in WorkloadResult.metrics; the FSDP startup log now names the active compute path so a silent GEMM fallback is greppable. 3. Validation: reject unknown compute_type values (a typo like "transfomer" used to silently fall back to GEMM = false green) and warn when shared_layer_weights is set without compute_type=transformer. New config fields: num_heads, ffn_size, seq_len, batch_size. Tests (CPU-only, no GPU/dist): test_race_transformer_smoke.py drives the real block + checksum verifier end to end (clean pass + injected-corruption catch); test_race_checksums.py covers the verifier; test_race.py adds compute_type validation cases. 15 race tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Module docstring note: mode=fsdp simulates the FSDP comm pattern with explicit all_gather/reduce_scatter; it does not use FSDP1/FSDP2. RepeatedTransformerBlock is reused as a compute kernel only, not under fully_shard. Pre-empts the "why not FSDP2 like llm_determinism?" question; real-FSDP is the deferred PR D. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Updates the race workload’s simulated FSDP mode so compute_type: transformer can run a real transformer block (via aorta.models.RepeatedTransformerBlock), and makes the per-layer checksum detector observable and safer to configure (validation + metrics). This strengthens the AINIC reproducer by ensuring “green” runs prove the detector executed and by increasing compute realism/pressure without switching to real FSDP.
Changes:
- Implement real transformer-block compute for the shared-weight transformer path in
FSDPModeReproducer, plus checksum observability counters. - Validate
compute_type(reject typos) and surfacecompute_type/ checksum verification counters intoWorkloadResult.metrics. - Add CPU-only unit + smoke tests covering config validation and per-layer checksum verification behavior.
Reviewed changes
Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
src/aorta/race/modes/fsdp.py |
Runs a real RepeatedTransformerBlock on the shared transformer path; records and verifies per-layer checksums; logs active compute path. |
src/aorta/race/config.py |
Adds transformer-shape knobs (num_heads, ffn_size, seq_len, batch_size) and result counters for detector observability. |
src/aorta/race/base.py |
Adds dtype aliases and plumbs checksum observability counters into ReproducerResult. |
src/aorta/workloads/race.py |
Validates compute_type, warns on inert config combos, and exports detector metrics via WorkloadResult.metrics. |
tests/workloads/test_race.py |
Adds tests for compute_type validation and warning behavior. |
tests/workloads/test_race_checksums.py |
Unit tests for _verify_layer_checksums (clean pass + corruption localization + edge cases). |
tests/workloads/test_race_transformer_smoke.py |
CPU smoke test that executes a real transformer block and exercises the checksum verifier end-to-end. |
recipes/ainic-gdr-flush-sdc.yaml |
Switches the AINIC recipe to compute_type: transformer + shared weights for stronger checksum invariants. |
docs/design/race-transformer-compute-support.md |
Adds a design/plan document describing the intended transformer compute + observability approach. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+187
to
+191
| with torch.random.fork_rng(devices=["cuda"]): | ||
| torch.cuda.manual_seed(0) | ||
| self.shared_block = ( | ||
| RepeatedTransformerBlock(block_cfg).to("cuda").to(self.dtype) | ||
| ) |
Comment on lines
+184
to
+190
| Use a single shared weight matrix for all layers (transformer compute type). | ||
|
|
||
| When True, all num_layers layers use the same weight matrix initialized with | ||
| a fixed seed, and each layer receives the same fixed reference input rather | ||
| than chaining activations. This makes per-layer forward outputs analytically | ||
| identical so that cross-layer activation comparison can serve as a secondary | ||
| corruption signal: any all_gather corruption that propagates through GEMM |
Comment on lines
+459
to
+464
| Four checksums per layer: | ||
| comm_input -- param shard before all_gather (should be identical: | ||
| every shard is filled with float(rank)) | ||
| comm_output -- full_param after all_gather | ||
| compute_input -- reference_input fed to GEMM (constant across layers) | ||
| compute_output-- activation after GELU |
Comment on lines
+100
to
+102
| # and each layer independently receives the same fixed reference input (seed=1) | ||
| # rather than chaining activations layer-to-layer. This makes every layer's | ||
| # forward output analytically identical so _verify_layer_activations() can serve |
Comment on lines
+1
to
+5
| # Plan — Fully support `compute_type: transformer` + shared-weight checksums (PR #210) | ||
|
|
||
| **Branch:** `users/oyazdanb/race-transformer-compute` (off `main`) | ||
| **Status:** DRAFT — for review before implementation | ||
| **Related:** PR #210 (`users/mycpuorg/shared-weight-transformer-checksums`), task `race__cross-rank-and-iter0-detection.md` |
num_heads and ffn_size auto-derive when 0 (model_dim//128, model_dim*4), so the config value alone doesn't record what actually ran. Capture the resolved values and surface them in the FSDP startup log (resolved_shape=...) and in WorkloadResult.metrics (eff_num_heads/eff_ffn_size/eff_seq_len/eff_batch_size), so the transformer shape a run used is provable from the result JSON. getattr in base.run keeps it safe for non-FSDP modes. Adds an auto-derive unit test. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- fsdp.py: also torch.manual_seed(0) inside fork_rng. RepeatedTransformerBlock
inits params on CPU before .to("cuda"), so seeding only the CUDA RNG left the
shared block's weights dependent on each rank's CPU RNG state -- ranks could
diverge while the intra-rank per-layer checksum still passed (false green).
Seeding CPU RNG too restores the rank-identical invariant the comment claims.
- config.py / fsdp.py / recipe: update shared_layer_weights + _verify_layer_checksums
docstrings and the recipe comment that still described the old GEMM (gelu(W@x))
path and the wrong function name (_verify_layer_activations); the shared path
now runs a real RepeatedTransformerBlock.
- remove docs/design/race-transformer-compute-support.md from the PR (it was a
pre-implementation draft; the PR description carries the summary).
Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Comment on lines
+505
to
+525
| # Count every cross-layer comparison so a clean (green) run still | ||
| # proves the detector ran: layers_verified > 0. | ||
| self.layers_verified += 1 | ||
| for key in ("comm_input", "comm_output", "compute_input", "compute_output"): | ||
| if cmp[key] != ref[key]: | ||
| log.error( | ||
| f"LAYER_CHECKSUM_MISMATCH ({key}): " | ||
| f"rank={self.rank} iter={iteration} " | ||
| f"layer_0={ref[key]} layer_{i}={cmp[key]}" | ||
| ) | ||
| self.corruption_details.append({ | ||
| "type": f"layer_checksum_mismatch_{key}", | ||
| "rank": self.rank, | ||
| "iteration": iteration, | ||
| "layer_ref": 0, | ||
| "layer_cmp": i, | ||
| "ref_checksum": ref[key], | ||
| "cmp_checksum": cmp[key], | ||
| }) | ||
| self.layer_checksum_mismatches += 1 | ||
| all_correct = False |
Comment on lines
+283
to
+287
| Shared-weight path: every layer receives the same fixed reference_input | ||
| so that outputs are analytically identical. Input/output checksums are | ||
| recorded for both the comm kernel (all_gather) and the compute kernel | ||
| (GEMM + GELU) so _verify_layer_checksums() can pinpoint whether | ||
| corruption entered during communication or compute. |
Comment on lines
223
to
234
| @@ -150,11 +233,23 @@ def setup_buffers(self) -> None: | |||
| dtype=self.dtype, device="cuda", | |||
| ) | |||
Comment on lines
164
to
+168
| if cfg.simulate_compute: | ||
| dim = self._dim | ||
| self.weight_matrices = [ | ||
| torch.randn( | ||
| dim, dim, | ||
| dtype=self.dtype, device="cuda", | ||
| use_shared = ( | ||
| cfg.shared_layer_weights and cfg.compute_type == "transformer" | ||
| ) |
… doc) - layer_checksum_mismatches now counts once per CORRUPTED LAYER, not once per checksum key. A single bad layer could previously inflate it up to 4x (comm_in/comm_out/compute_in/compute_out), contradicting the per-layer docstring. corruption_details still records every key for localization. - Validate compute_type in ReproducerConfig.__post_init__ so EVERY entry point is covered (the aorta.race CLI / direct construction), not just the RaceWorkload adapter -- a typo like "transfomer" now raises instead of a silent GEMM false-green. - Skip the unused dim x dim activation/grad_buffer allocations on the shared-weight transformer path (forward sets activation to the block output; backward re-runs the block) -- avoids needless GPU memory / OOM risk at large model_dim. GEMM/chained path keeps them. - Fix the _forward_layer docstring that still said "GEMM + GELU" on the shared path (now runs RepeatedTransformerBlock). Tests: per-layer-count (4 keys bad -> counter==1), direct-construction compute_type rejection. 18 race tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Comment on lines
+294
to
+304
| def __post_init__(self) -> None: | ||
| # Validate here (not only in the RaceWorkload adapter) so EVERY entry | ||
| # point is covered -- the aorta.race CLI and any direct reproducer | ||
| # construction. A typo like "transfomer" must error, not silently fall | ||
| # back to the GEMM path (false green). | ||
| valid_compute = {"gemm", "transformer"} | ||
| if self.compute_type not in valid_compute: | ||
| raise ValueError( | ||
| f"compute_type must be one of {sorted(valid_compute)}, " | ||
| f"got {self.compute_type!r}" | ||
| ) |
Comment on lines
+268
to
+277
| def _checksum(tensor: torch.Tensor) -> int: | ||
| """ | ||
| Bitwise checksum: reinterpret-cast to int16 and sum. | ||
|
|
||
| bf16 (or any 16-bit dtype) is viewed as int16 so every bit pattern | ||
| contributes to the checksum with zero information loss -- no float | ||
| rounding, no abs(), and NaN / denorm bit patterns are included. | ||
| Accumulation is done in int64 to avoid overflow. | ||
| """ | ||
| return tensor.view(torch.int16).to(torch.int64).sum().item() |
Comment on lines
+166
to
+168
| use_shared = ( | ||
| cfg.shared_layer_weights and cfg.compute_type == "transformer" | ||
| ) |
Comment on lines
+54
to
+60
| """Per-layer forward + 4 checksums, exactly like _forward_layer's shared path.""" | ||
| layer_checksums = [] | ||
| for _ in range(NUM_LAYERS): | ||
| comm_input = FSDPModeReproducer._checksum(reference_input) | ||
| comm_output = comm_input # no real all_gather on CPU; identical by construction | ||
| compute_input = FSDPModeReproducer._checksum(reference_input) | ||
| with torch.no_grad(): |
…istry, test) - _checksum is now element-size aware: 2-byte dtypes -> int16, 4-byte (fp32) -> int32, etc. Previously hard-coded torch.int16, which would crash/miscount on float32 (an allowed dtype). bf16 path unchanged. - compute_type='transformer' with shared_layer_weights=False now warns loudly that it runs the GEMM path (only the shared transformer path is implemented) -- no more silent transformer->GEMM fallback. - ReproducerConfig.__post_init__ validates compute_type against the pluggable COMPUTE_REGISTRY (single source of truth; "transformer" is already registered) instead of a hard-coded set, so custom register_compute() backends stay valid. Lazy import avoids a circular import (compute.py only TYPE_CHECKING-imports config). - Smoke test: comm_input/comm_output checksums now use distinct rank-fill shard / all_gather-result buffers (mirroring real _forward_layer) instead of reference_input, so the test actually exercises the comm path. Added a multi-dtype _checksum test (bf16/fp16/fp32). 19 race tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Builds on #210 (shared-weight transformer + int16 per-layer checksums) and makes its
compute_type: transformerpath actually exercise a real transformer, observable, and validated. This branch was cut from #210's branch, so its diff includes #210's commits — intended to supersede / land on top of #210 (coordinate with @mycpuorg on merge order).Three changes:
Real transformer compute.
compute_type: transformerpreviously did a singlegelu(weight_matrix @ reference_input)matmul — not a transformer, and ~0.0 ms/step (no real memory pressure). It now runs a realRepeatedTransformerBlock(the modelllm_determinismuses: real MHA + GLU FFN + LayerNorm), borrowed fromaorta.models(no new model file). One shared block (fixed seed, rank-identical viafork_rng) + a fixed 3D reference input keeps every layer's output byte-identical, so race/fsdp: shared-weight transformer with per-kernel int16 checksum verification #210's per-layer checksum invariant still holds — while supplying real-model L2/HBM pressure. The explicitall_gather/reduce_scatter(the AINIC data path under test) are untouched.Observability. A clean run previously emitted no checksum signal, so a green result was indistinguishable from a no-op. Now
layers_verified+layer_checksum_mismatches(+compute_type) are surfaced inWorkloadResult.metrics, and the FSDP startup log names the active compute path (transformer_block=active layer_checksum_verify=ON) so a silent GEMM fallback is greppable.layers_verified > 0proves the detector ran.Validation. Unknown
compute_typevalues now raise (a typo liketransfomerused to silently fall back to GEMM = false green);shared_layer_weightswithoutcompute_type=transformerwarns.New
ReproducerConfigfields:num_heads,ffn_size,seq_len,batch_size.Not real FSDP (by design)
This harness simulates the FSDP comm pattern with explicit collectives; it does not use
torch.distributedFSDP1/FSDP2.RepeatedTransformerBlockis a compute kernel only — not underfully_shard. Explicit collectives are what give the clean rank-fill + shared-input checksum invariant. Real-FSDP coverage is a separate deferred workload (PR D).Test plan
CPU-only, no GPU / no torch.distributed — runs in ~2 s:
pytest tests/workloads/test_race_transformer_smoke.py -v— drives the real block + checksum verifier end to end: real forward runs, shared weights make layers byte-identical, injected corruption is caught and localized to compute.pytest tests/workloads/test_race_checksums.py -v— verifier unit tests (clean pass, comm-corruption localized, compute-corruption localized, edge cases) incl. the newlayers_verified/mismatchescounters.pytest tests/workloads/test_race.py -v— addscompute_typevalidation cases. (15 race tests pass together.)pip install -e .) on the 5-node MI355X AINIC, runrecipes/ainic-gdr-flush-sdc.yaml; confirm metrics showcompute_type: transformer,layers_verified > 0, non-zeroavg_step_time_ms. (in progress)🤖 Generated with Claude Code