race/fsdp: shared-weight transformer with per-kernel int16 checksum verification by mycpuorg · Pull Request #210 · ROCm/aorta

mycpuorg · 2026-06-04T22:10:56Z

Summary

The AINIC GDR flush SDC recipe (ainic-gdr-flush-sdc.yaml) used compute_type: gemm with independently-random weight matrices per layer. This makes cross-layer comparison meaningless — gradient/activation differences across layers could be numerical divergence rather than corruption.

This PR switches to a synthetic identical-layer transformer model: all layers share one weight matrix (fixed seed) and receive the same fixed reference input, so every layer's forward output is analytically identical. Any divergence across layers is a corruption signal, not numerical noise.

Per-kernel int16 checksums

Following the bitwise checksum design from the onsite whiteboard (Jun 2 2026): each tensor is reinterpret-cast from bf16 to int16 and summed in int64 — zero information loss, no float rounding, catches any bit flip.

Four checksums are recorded per layer, wrapping both kernels:

Checkpoint	What it checksums
`comm_input`	param shard before all_gather
`comm_output`	full_param after all_gather
`compute_input`	reference_input before GEMM
`compute_output`	activation after GELU

_verify_layer_checksums() compares all four across layers. The diagnostic pinpoints where corruption entered:

comm_output diverges, comm_input matches → corruption in the collective (RCCL/NIC path)
compute_output diverges, comm_output matches → corruption in GPU compute (ALU)

Changes

src/aorta/race/base.py: Add bf16, fp16, fp32 as dtype short-form aliases
src/aorta/race/config.py: Add shared_layer_weights: bool field to ReproducerConfig
src/aorta/race/modes/fsdp.py:
- _checksum(): tensor.view(int16).to(int64).sum() — bitwise, lossless
- setup_buffers(): shared weight matrix (seed=0) + fixed reference input (seed=1) when shared_layer_weights=True
- _forward_layer(): records comm and compute input/output checksums per layer
- _verify_layer_checksums(): cross-layer comparison of all four checksum keys
recipes/ainic-gdr-flush-sdc.yaml: Switch to compute_type: transformer, shared_layer_weights: true, dtype: bf16; drop dead gemm_size/gemm_layers fields

Test plan

aorta triage run --recipe recipes/ainic-gdr-flush-sdc.yaml --dry-run passes with no "ignoring unknown" warnings
Single-node smoke run confirms all layer checksums match (zero corruption on clean hardware)
Inject synthetic corruption (flip bits in one layer's all_gather output) and verify LAYER_CHECKSUM_MISMATCH (comm_output) is logged with correct layer index
Existing _verify_all_gather / _verify_reduce_scatter pattern checks still fire independently

The dtype map in BaseReproducer only accepted the verbose PyTorch names (bfloat16, float16, float32). Add the common short-form aliases so recipes and CLI args can use either form.

Add shared_layer_weights config field for the FSDP reproducer. When enabled with compute_type=transformer, all layers share a single weight matrix (fixed seed) and receive the same fixed reference input so every layer's forward output is analytically identical. Per-kernel int16 checksums (reinterpret-cast bf16 -> int16, sum in int64) wrap both the comm kernel (all_gather) and compute kernel (GEMM + GELU) with input/output checksums per layer. Cross-layer checksum comparison in _verify_layer_checksums() pinpoints whether corruption entered during communication (RCCL/NIC) or compute (GPU ALU) -- a second independent signal beyond the collective-buffer pattern check.

Replace compute_type=gemm with compute_type=transformer in the ainic-gdr-flush-sdc recipe. Use shared_layer_weights=true so all 24 layers share one weight matrix and a fixed reference input, enabling cross-layer int16 checksum comparison as a second corruption signal. Also switch dtype from bfloat16 to bf16 (short alias) and drop the now-unused gemm_size/gemm_layers fields.

Copilot

Pull request overview

This PR updates the race FSDP reproducer to make cross-layer divergence a stronger corruption signal by optionally running a synthetic “identical-layer” transformer (shared weights + fixed reference input) and recording per-layer, per-kernel bitwise checksums. It also updates configuration and the AINIC GDR-flush SDC recipe to use the transformer compute path.

Changes:

Add dtype short-form aliases (bf16, fp16, fp32) to the race reproducer base dtype resolver.
Add shared_layer_weights to ReproducerConfig and implement shared-weight + fixed-input transformer behavior in FSDP mode with per-layer checksum verification.
Update ainic-gdr-flush-sdc.yaml to switch from GEMM compute to transformer compute and enable shared_layer_weights.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File	Description
`src/aorta/race/modes/fsdp.py`	Adds shared-weight transformer buffers, checksum computation, and cross-layer checksum verification in FSDP mode.
`src/aorta/race/config.py`	Introduces `shared_layer_weights` config knob and documentation.
`src/aorta/race/base.py`	Adds dtype string aliases for bf16/fp16/fp32.
`recipes/ainic-gdr-flush-sdc.yaml`	Switches the recipe to transformer compute + shared weights and updates related documentation/comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+  # shared_layer_weights=true: all 24 layers share one weight matrix (fixed seed=0)
+  # and each layer independently receives the same fixed reference input (seed=1)
+  # rather than chaining activations layer-to-layer.  This makes every layer's
+  # forward output analytically identical so _verify_layer_activations() can serve
+  # as a second independent corruption signal alongside the collective-buffer
+  # pattern check: a mismatch across layers means an all_gather corruption on that
+  # layer propagated through GEMM compute -- something the rank-fill pattern check
+  # on full_param alone cannot catch.


+    When True, all num_layers layers use the same weight matrix initialized with
+    a fixed seed, and each layer receives the same fixed reference input rather
+    than chaining activations.  This makes per-layer forward outputs analytically
+    identical so that cross-layer activation comparison can serve as a secondary
+    corruption signal: any all_gather corruption that propagates through GEMM
+    compute will produce a mismatch between layer outputs.
+


+    @staticmethod
+    def _checksum(tensor: torch.Tensor) -> int:
+        """
+        Bitwise checksum: reinterpret-cast to int16 and sum.
+
+        bf16 (or any 16-bit dtype) is viewed as int16 so every bit pattern
+        contributes to the checksum with zero information loss -- no float
+        rounding, no abs(), and NaN / denorm bit patterns are included.
+        Accumulation is done in int64 to avoid overflow.
+        """
+        return tensor.view(torch.int16).to(torch.int64).sum().item()


Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

+        use_shared = (
+            self.config.shared_layer_weights
+            and self.config.compute_type == "transformer"
+            and self.reference_input is not None
+        )


        if self.config.simulate_compute and self.weight_matrices:
-            # Use batch_gpu for data dependency on first layer (H2D race opportunity)
-            if layer_idx == 0:
-                dim = self._dim
-                batch_slice = self.batch_gpu[:dim * dim]
-                self.activation = batch_slice.view(dim, dim)
-
-            self.activation = torch.mm(
-                self.weight_matrices[layer_idx], self.activation
-            )
-            self.activation = torch.nn.functional.gelu(self.activation)
+            if use_shared:
+                compute_input_cksum = self._checksum(self.reference_input)
+
+                out = torch.mm(self.weight_matrices[layer_idx], self.reference_input)


+    When True, all num_layers layers use the same weight matrix initialized with
+    a fixed seed, and each layer receives the same fixed reference input rather
+    than chaining activations.  This makes per-layer forward outputs analytically
+    identical so that cross-layer activation comparison can serve as a secondary
+    corruption signal: any all_gather corruption that propagates through GEMM
+    compute will produce a mismatch between layer outputs.
+


+  # rather than chaining activations layer-to-layer.  This makes every layer's
+  # forward output analytically identical so _verify_layer_activations() can serve
+  # as a second independent corruption signal alongside the collective-buffer
+  # pattern check: a mismatch across layers means an all_gather corruption on that
+  # layer propagated through GEMM compute -- something the rank-fill pattern check
+  # on full_param alone cannot catch.


_VALID_DTYPES only listed the verbose names (bfloat16, float16, float32). The base.py dtype map already accepts bf16/fp16/fp32 but the RaceWorkload adapter raised ValueError before reaching it, causing every trial to exit during setup() and be classified as did_not_run by the triage runner.

…nsformer-checksums' into users/mycpuorg/shared-weight-transformer-checksums

Alkaid-Benetnash · 2026-06-05T19:26:33Z

+        rounding, no abs(), and NaN / denorm bit patterns are included.
+        Accumulation is done in int64 to avoid overflow.
+        """
+        return tensor.view(torch.int16).to(torch.int64).sum().item()


Just to point out that overflow is expected. And since this checksum only cares about the binary value, not integer semantics. int16 or int64 affects more of "hash-collision" rather than "overflow".

mycpuorg added 3 commits June 4, 2026 15:06

Add bf16/fp16/fp32 dtype aliases to race reproducer

6765491

The dtype map in BaseReproducer only accepted the verbose PyTorch names (bfloat16, float16, float32). Add the common short-form aliases so recipes and CLI args can use either form.

Copilot AI review requested due to automatic review settings June 4, 2026 22:10

Copilot started reviewing on behalf of mycpuorg June 4, 2026 22:11 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Potential fix for pull request finding

9aaa929

Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>

Copilot AI review requested due to automatic review settings June 4, 2026 22:51

Copilot started reviewing on behalf of mycpuorg June 4, 2026 22:51 View session

Copilot AI reviewed Jun 4, 2026

View reviewed changes

mycpuorg and others added 2 commits June 4, 2026 16:01

Merge remote-tracking branch 'origin/users/mycpuorg/shared-weight-tra…

3a4e75a

…nsformer-checksums' into users/mycpuorg/shared-weight-transformer-checksums

oyazdanb mentioned this pull request Jun 5, 2026

race/fsdp: real transformer compute + observable per-layer checksum detector #211

Merged

4 tasks

Alkaid-Benetnash reviewed Jun 8, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

race/fsdp: shared-weight transformer with per-kernel int16 checksum verification#210

race/fsdp: shared-weight transformer with per-kernel int16 checksum verification#210
mycpuorg wants to merge 6 commits into
mainfrom
users/mycpuorg/shared-weight-transformer-checksums

mycpuorg commented Jun 4, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Alkaid-Benetnash Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

mycpuorg commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Per-kernel int16 checksums

Changes

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Alkaid-Benetnash Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mycpuorg commented Jun 4, 2026 •

edited

Loading