race/fsdp: real transformer compute + observable per-layer checksum detector by oyazdanb · Pull Request #211 · ROCm/aorta

oyazdanb · 2026-06-05T16:07:43Z

Summary

Builds on #210 (shared-weight transformer + int16 per-layer checksums) and makes its compute_type: transformer path actually exercise a real transformer, observable, and validated. This branch was cut from #210's branch, so its diff includes #210's commits — intended to supersede / land on top of #210 (coordinate with @mycpuorg on merge order).

Three changes:

Real transformer compute. compute_type: transformer previously did a single gelu(weight_matrix @ reference_input) matmul — not a transformer, and ~0.0 ms/step (no real memory pressure). It now runs a real RepeatedTransformerBlock (the model llm_determinism uses: real MHA + GLU FFN + LayerNorm), borrowed from aorta.models (no new model file). One shared block (fixed seed, rank-identical via fork_rng) + a fixed 3D reference input keeps every layer's output byte-identical, so race/fsdp: shared-weight transformer with per-kernel int16 checksum verification #210's per-layer checksum invariant still holds — while supplying real-model L2/HBM pressure. The explicit all_gather/reduce_scatter (the AINIC data path under test) are untouched.
Observability. A clean run previously emitted no checksum signal, so a green result was indistinguishable from a no-op. Now layers_verified + layer_checksum_mismatches (+ compute_type) are surfaced in WorkloadResult.metrics, and the FSDP startup log names the active compute path (transformer_block=active layer_checksum_verify=ON) so a silent GEMM fallback is greppable. layers_verified > 0 proves the detector ran.
Validation. Unknown compute_type values now raise (a typo like transfomer used to silently fall back to GEMM = false green); shared_layer_weights without compute_type=transformer warns.

New ReproducerConfig fields: num_heads, ffn_size, seq_len, batch_size.

Not real FSDP (by design)

This harness simulates the FSDP comm pattern with explicit collectives; it does not use torch.distributed FSDP1/FSDP2. RepeatedTransformerBlock is a compute kernel only — not under fully_shard. Explicit collectives are what give the clean rank-fill + shared-input checksum invariant. Real-FSDP coverage is a separate deferred workload (PR D).

Test plan

CPU-only, no GPU / no torch.distributed — runs in ~2 s:

pytest tests/workloads/test_race_transformer_smoke.py -v — drives the real block + checksum verifier end to end: real forward runs, shared weights make layers byte-identical, injected corruption is caught and localized to compute.
pytest tests/workloads/test_race_checksums.py -v — verifier unit tests (clean pass, comm-corruption localized, compute-corruption localized, edge cases) incl. the new layers_verified/mismatches counters.
pytest tests/workloads/test_race.py -v — adds compute_type validation cases. (15 race tests pass together.)
Cluster: install branch (pip install -e .) on the 5-node MI355X AINIC, run recipes/ainic-gdr-flush-sdc.yaml; confirm metrics show compute_type: transformer, layers_verified > 0, non-zero avg_step_time_ms. (in progress)

🤖 Generated with Claude Code

…etector Builds on PR #210 (shared-weight transformer + int16 checksums). Three changes: 1. compute_type=transformer now runs a REAL transformer block borrowed from aorta.models.RepeatedTransformerBlock (the model llm_determinism uses: real MHA + GLU FFN + LayerNorm) instead of a single gelu(W @ x) matmul. One shared block (fixed seed, rank-identical via fork_rng) + a fixed 3D reference input keeps every layer's output byte-identical, preserving the per-layer checksum invariant while supplying real-model L2/HBM pressure. Explicit all_gather/reduce_scatter (the AINIC data path) are untouched. 2. Observability: a clean run previously emitted no checksum signal, so green was indistinguishable from a no-op. Track layers_verified and layer_checksum_mismatches and surface them (plus compute_type) in WorkloadResult.metrics; the FSDP startup log now names the active compute path so a silent GEMM fallback is greppable. 3. Validation: reject unknown compute_type values (a typo like "transfomer" used to silently fall back to GEMM = false green) and warn when shared_layer_weights is set without compute_type=transformer. New config fields: num_heads, ffn_size, seq_len, batch_size. Tests (CPU-only, no GPU/dist): test_race_transformer_smoke.py drives the real block + checksum verifier end to end (clean pass + injected-corruption catch); test_race_checksums.py covers the verifier; test_race.py adds compute_type validation cases. 15 race tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Module docstring note: mode=fsdp simulates the FSDP comm pattern with explicit all_gather/reduce_scatter; it does not use FSDP1/FSDP2. RepeatedTransformerBlock is reused as a compute kernel only, not under fully_shard. Pre-empts the "why not FSDP2 like llm_determinism?" question; real-FSDP is the deferred PR D. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Copilot

Pull request overview

Updates the race workload’s simulated FSDP mode so compute_type: transformer can run a real transformer block (via aorta.models.RepeatedTransformerBlock), and makes the per-layer checksum detector observable and safer to configure (validation + metrics). This strengthens the AINIC reproducer by ensuring “green” runs prove the detector executed and by increasing compute realism/pressure without switching to real FSDP.

Changes:

Implement real transformer-block compute for the shared-weight transformer path in FSDPModeReproducer, plus checksum observability counters.
Validate compute_type (reject typos) and surface compute_type / checksum verification counters into WorkloadResult.metrics.
Add CPU-only unit + smoke tests covering config validation and per-layer checksum verification behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
`src/aorta/race/modes/fsdp.py`	Runs a real `RepeatedTransformerBlock` on the shared transformer path; records and verifies per-layer checksums; logs active compute path.
`src/aorta/race/config.py`	Adds transformer-shape knobs (`num_heads`, `ffn_size`, `seq_len`, `batch_size`) and result counters for detector observability.
`src/aorta/race/base.py`	Adds dtype aliases and plumbs checksum observability counters into `ReproducerResult`.
`src/aorta/workloads/race.py`	Validates `compute_type`, warns on inert config combos, and exports detector metrics via `WorkloadResult.metrics`.
`tests/workloads/test_race.py`	Adds tests for `compute_type` validation and warning behavior.
`tests/workloads/test_race_checksums.py`	Unit tests for `_verify_layer_checksums` (clean pass + corruption localization + edge cases).
`tests/workloads/test_race_transformer_smoke.py`	CPU smoke test that executes a real transformer block and exercises the checksum verifier end-to-end.
`recipes/ainic-gdr-flush-sdc.yaml`	Switches the AINIC recipe to `compute_type: transformer` + shared weights for stronger checksum invariants.
`docs/design/race-transformer-compute-support.md`	Adds a design/plan document describing the intended transformer compute + observability approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+                with torch.random.fork_rng(devices=["cuda"]):
+                    torch.cuda.manual_seed(0)
+                    self.shared_block = (
+                        RepeatedTransformerBlock(block_cfg).to("cuda").to(self.dtype)
+                    )


+    Use a single shared weight matrix for all layers (transformer compute type).
+
+    When True, all num_layers layers use the same weight matrix initialized with
+    a fixed seed, and each layer receives the same fixed reference input rather
+    than chaining activations.  This makes per-layer forward outputs analytically
+    identical so that cross-layer activation comparison can serve as a secondary
+    corruption signal: any all_gather corruption that propagates through GEMM


+        Four checksums per layer:
+          comm_input    -- param shard before all_gather (should be identical:
+                           every shard is filled with float(rank))
+          comm_output   -- full_param after all_gather
+          compute_input -- reference_input fed to GEMM (constant across layers)
+          compute_output-- activation after GELU


+  # and each layer independently receives the same fixed reference input (seed=1)
+  # rather than chaining activations layer-to-layer.  This makes every layer's
+  # forward output analytically identical so _verify_layer_activations() can serve


+# Plan — Fully support `compute_type: transformer` + shared-weight checksums (PR #210)
+
+**Branch:** `users/oyazdanb/race-transformer-compute` (off `main`)
+**Status:** DRAFT — for review before implementation
+**Related:** PR #210 (`users/mycpuorg/shared-weight-transformer-checksums`), task `race__cross-rank-and-iter0-detection.md`


num_heads and ffn_size auto-derive when 0 (model_dim//128, model_dim*4), so the config value alone doesn't record what actually ran. Capture the resolved values and surface them in the FSDP startup log (resolved_shape=...) and in WorkloadResult.metrics (eff_num_heads/eff_ffn_size/eff_seq_len/eff_batch_size), so the transformer shape a run used is provable from the result JSON. getattr in base.run keeps it safe for non-FSDP modes. Adds an auto-derive unit test. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

- fsdp.py: also torch.manual_seed(0) inside fork_rng. RepeatedTransformerBlock inits params on CPU before .to("cuda"), so seeding only the CUDA RNG left the shared block's weights dependent on each rank's CPU RNG state -- ranks could diverge while the intra-rank per-layer checksum still passed (false green). Seeding CPU RNG too restores the rank-identical invariant the comment claims. - config.py / fsdp.py / recipe: update shared_layer_weights + _verify_layer_checksums docstrings and the recipe comment that still described the old GEMM (gelu(W@x)) path and the wrong function name (_verify_layer_activations); the shared path now runs a real RepeatedTransformerBlock. - remove docs/design/race-transformer-compute-support.md from the PR (it was a pre-implementation draft; the PR description carries the summary). Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

+            # Count every cross-layer comparison so a clean (green) run still
+            # proves the detector ran: layers_verified > 0.
+            self.layers_verified += 1
+            for key in ("comm_input", "comm_output", "compute_input", "compute_output"):
+                if cmp[key] != ref[key]:
+                    log.error(
+                        f"LAYER_CHECKSUM_MISMATCH ({key}): "
+                        f"rank={self.rank} iter={iteration} "
+                        f"layer_0={ref[key]} layer_{i}={cmp[key]}"
+                    )
+                    self.corruption_details.append({
+                        "type": f"layer_checksum_mismatch_{key}",
+                        "rank": self.rank,
+                        "iteration": iteration,
+                        "layer_ref": 0,
+                        "layer_cmp": i,
+                        "ref_checksum": ref[key],
+                        "cmp_checksum": cmp[key],
+                    })
+                    self.layer_checksum_mismatches += 1
+                    all_correct = False


+        Shared-weight path: every layer receives the same fixed reference_input
+        so that outputs are analytically identical.  Input/output checksums are
+        recorded for both the comm kernel (all_gather) and the compute kernel
+        (GEMM + GELU) so _verify_layer_checksums() can pinpoint whether
+        corruption entered during communication or compute.


@@ -150,11 +233,23 @@ def setup_buffers(self) -> None:
                dtype=self.dtype, device="cuda",
            )


        if cfg.simulate_compute:
            dim = self._dim
-            self.weight_matrices = [
-                torch.randn(
-                    dim, dim,
-                    dtype=self.dtype, device="cuda",
+            use_shared = (
+                cfg.shared_layer_weights and cfg.compute_type == "transformer"
+            )


… doc) - layer_checksum_mismatches now counts once per CORRUPTED LAYER, not once per checksum key. A single bad layer could previously inflate it up to 4x (comm_in/comm_out/compute_in/compute_out), contradicting the per-layer docstring. corruption_details still records every key for localization. - Validate compute_type in ReproducerConfig.__post_init__ so EVERY entry point is covered (the aorta.race CLI / direct construction), not just the RaceWorkload adapter -- a typo like "transfomer" now raises instead of a silent GEMM false-green. - Skip the unused dim x dim activation/grad_buffer allocations on the shared-weight transformer path (forward sets activation to the block output; backward re-runs the block) -- avoids needless GPU memory / OOM risk at large model_dim. GEMM/chained path keeps them. - Fix the _forward_layer docstring that still said "GEMM + GELU" on the shared path (now runs RepeatedTransformerBlock). Tests: per-layer-count (4 keys bad -> counter==1), direct-construction compute_type rejection. 18 race tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

+    def __post_init__(self) -> None:
+        # Validate here (not only in the RaceWorkload adapter) so EVERY entry
+        # point is covered -- the aorta.race CLI and any direct reproducer
+        # construction. A typo like "transfomer" must error, not silently fall
+        # back to the GEMM path (false green).
+        valid_compute = {"gemm", "transformer"}
+        if self.compute_type not in valid_compute:
+            raise ValueError(
+                f"compute_type must be one of {sorted(valid_compute)}, "
+                f"got {self.compute_type!r}"
+            )


+    def _checksum(tensor: torch.Tensor) -> int:
+        """
+        Bitwise checksum: reinterpret-cast to int16 and sum.
+
+        bf16 (or any 16-bit dtype) is viewed as int16 so every bit pattern
+        contributes to the checksum with zero information loss -- no float
+        rounding, no abs(), and NaN / denorm bit patterns are included.
+        Accumulation is done in int64 to avoid overflow.
+        """
+        return tensor.view(torch.int16).to(torch.int64).sum().item()


+            use_shared = (
+                cfg.shared_layer_weights and cfg.compute_type == "transformer"
+            )


+    """Per-layer forward + 4 checksums, exactly like _forward_layer's shared path."""
+    layer_checksums = []
+    for _ in range(NUM_LAYERS):
+        comm_input = FSDPModeReproducer._checksum(reference_input)
+        comm_output = comm_input  # no real all_gather on CPU; identical by construction
+        compute_input = FSDPModeReproducer._checksum(reference_input)
+        with torch.no_grad():


…istry, test) - _checksum is now element-size aware: 2-byte dtypes -> int16, 4-byte (fp32) -> int32, etc. Previously hard-coded torch.int16, which would crash/miscount on float32 (an allowed dtype). bf16 path unchanged. - compute_type='transformer' with shared_layer_weights=False now warns loudly that it runs the GEMM path (only the shared transformer path is implemented) -- no more silent transformer->GEMM fallback. - ReproducerConfig.__post_init__ validates compute_type against the pluggable COMPUTE_REGISTRY (single source of truth; "transformer" is already registered) instead of a hard-coded set, so custom register_compute() backends stay valid. Lazy import avoids a circular import (compute.py only TYPE_CHECKING-imports config). - Smoke test: comm_input/comm_output checksums now use distinct rank-fill shard / all_gather-result buffers (mirroring real _forward_layer) instead of reference_input, so the test actually exercises the comm path. Added a multi-dtype _checksum test (bf16/fp16/fp32). 19 race tests pass. Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

oyazdanb and others added 2 commits June 5, 2026 11:32

Copilot AI review requested due to automatic review settings June 5, 2026 16:07

Copilot started reviewing on behalf of oyazdanb June 5, 2026 16:07 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

oyazdanb and others added 2 commits June 5, 2026 12:17

Copilot AI review requested due to automatic review settings June 5, 2026 16:22

Copilot started reviewing on behalf of oyazdanb June 5, 2026 16:22 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

oyazdanb requested a review from Copilot June 5, 2026 17:04

Copilot started reviewing on behalf of oyazdanb June 5, 2026 17:04 View session

Copilot AI reviewed Jun 5, 2026

View reviewed changes

oyazdanb merged commit d5e3340 into main Jun 5, 2026
1 check passed

oyazdanb deleted the users/oyazdanb/race-transformer-compute branch June 5, 2026 18:47

oyazdanb mentioned this pull request Jun 5, 2026

race/fsdp: real autograd backward (Option 1) + fast smoke recipe #212

Merged

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

race/fsdp: real transformer compute + observable per-layer checksum detector#211

race/fsdp: real transformer compute + observable per-layer checksum detector#211
oyazdanb merged 6 commits into
mainfrom
users/oyazdanb/race-transformer-compute

oyazdanb commented Jun 5, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		@@ -150,11 +233,23 @@ def setup_buffers(self) -> None:
		dtype=self.dtype, device="cuda",
		)

Uh oh!

Conversation

oyazdanb commented Jun 5, 2026

Summary

Not real FSDP (by design)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants