Skip to content

race/fsdp: shared-weight transformer with per-kernel int16 checksum verification#210

Open
mycpuorg wants to merge 6 commits into
mainfrom
users/mycpuorg/shared-weight-transformer-checksums
Open

race/fsdp: shared-weight transformer with per-kernel int16 checksum verification#210
mycpuorg wants to merge 6 commits into
mainfrom
users/mycpuorg/shared-weight-transformer-checksums

Conversation

@mycpuorg

@mycpuorg mycpuorg commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator

Summary

The AINIC GDR flush SDC recipe (ainic-gdr-flush-sdc.yaml) used compute_type: gemm with independently-random weight matrices per layer. This makes cross-layer comparison meaningless — gradient/activation differences across layers could be numerical divergence rather than corruption.

This PR switches to a synthetic identical-layer transformer model: all layers share one weight matrix (fixed seed) and receive the same fixed reference input, so every layer's forward output is analytically identical. Any divergence across layers is a corruption signal, not numerical noise.

Per-kernel int16 checksums

Following the bitwise checksum design from the onsite whiteboard (Jun 2 2026): each tensor is reinterpret-cast from bf16 to int16 and summed in int64 — zero information loss, no float rounding, catches any bit flip.

Four checksums are recorded per layer, wrapping both kernels:

Checkpoint What it checksums
comm_input param shard before all_gather
comm_output full_param after all_gather
compute_input reference_input before GEMM
compute_output activation after GELU

_verify_layer_checksums() compares all four across layers. The diagnostic pinpoints where corruption entered:

  • comm_output diverges, comm_input matches → corruption in the collective (RCCL/NIC path)
  • compute_output diverges, comm_output matches → corruption in GPU compute (ALU)

Changes

  • src/aorta/race/base.py: Add bf16, fp16, fp32 as dtype short-form aliases
  • src/aorta/race/config.py: Add shared_layer_weights: bool field to ReproducerConfig
  • src/aorta/race/modes/fsdp.py:
    • _checksum(): tensor.view(int16).to(int64).sum() — bitwise, lossless
    • setup_buffers(): shared weight matrix (seed=0) + fixed reference input (seed=1) when shared_layer_weights=True
    • _forward_layer(): records comm and compute input/output checksums per layer
    • _verify_layer_checksums(): cross-layer comparison of all four checksum keys
  • recipes/ainic-gdr-flush-sdc.yaml: Switch to compute_type: transformer, shared_layer_weights: true, dtype: bf16; drop dead gemm_size/gemm_layers fields

Test plan

  • aorta triage run --recipe recipes/ainic-gdr-flush-sdc.yaml --dry-run passes with no "ignoring unknown" warnings
  • Single-node smoke run confirms all layer checksums match (zero corruption on clean hardware)
  • Inject synthetic corruption (flip bits in one layer's all_gather output) and verify LAYER_CHECKSUM_MISMATCH (comm_output) is logged with correct layer index
  • Existing _verify_all_gather / _verify_reduce_scatter pattern checks still fire independently

mycpuorg added 3 commits June 4, 2026 15:06
The dtype map in BaseReproducer only accepted the verbose PyTorch
names (bfloat16, float16, float32).  Add the common short-form
aliases so recipes and CLI args can use either form.
Add shared_layer_weights config field for the FSDP reproducer.  When
enabled with compute_type=transformer, all layers share a single
weight matrix (fixed seed) and receive the same fixed reference input
so every layer's forward output is analytically identical.

Per-kernel int16 checksums (reinterpret-cast bf16 -> int16, sum in
int64) wrap both the comm kernel (all_gather) and compute kernel
(GEMM + GELU) with input/output checksums per layer.  Cross-layer
checksum comparison in _verify_layer_checksums() pinpoints whether
corruption entered during communication (RCCL/NIC) or compute (GPU
ALU) -- a second independent signal beyond the collective-buffer
pattern check.
Replace compute_type=gemm with compute_type=transformer in the
ainic-gdr-flush-sdc recipe.  Use shared_layer_weights=true so all 24
layers share one weight matrix and a fixed reference input, enabling
cross-layer int16 checksum comparison as a second corruption signal.

Also switch dtype from bfloat16 to bf16 (short alias) and drop the
now-unused gemm_size/gemm_layers fields.
Copilot AI review requested due to automatic review settings June 4, 2026 22:10

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the race FSDP reproducer to make cross-layer divergence a stronger corruption signal by optionally running a synthetic “identical-layer” transformer (shared weights + fixed reference input) and recording per-layer, per-kernel bitwise checksums. It also updates configuration and the AINIC GDR-flush SDC recipe to use the transformer compute path.

Changes:

  • Add dtype short-form aliases (bf16, fp16, fp32) to the race reproducer base dtype resolver.
  • Add shared_layer_weights to ReproducerConfig and implement shared-weight + fixed-input transformer behavior in FSDP mode with per-layer checksum verification.
  • Update ainic-gdr-flush-sdc.yaml to switch from GEMM compute to transformer compute and enable shared_layer_weights.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
src/aorta/race/modes/fsdp.py Adds shared-weight transformer buffers, checksum computation, and cross-layer checksum verification in FSDP mode.
src/aorta/race/config.py Introduces shared_layer_weights config knob and documentation.
src/aorta/race/base.py Adds dtype string aliases for bf16/fp16/fp32.
recipes/ainic-gdr-flush-sdc.yaml Switches the recipe to transformer compute + shared weights and updates related documentation/comments.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread recipes/ainic-gdr-flush-sdc.yaml Outdated
Comment on lines +99 to +106
# shared_layer_weights=true: all 24 layers share one weight matrix (fixed seed=0)
# and each layer independently receives the same fixed reference input (seed=1)
# rather than chaining activations layer-to-layer. This makes every layer's
# forward output analytically identical so _verify_layer_activations() can serve
# as a second independent corruption signal alongside the collective-buffer
# pattern check: a mismatch across layers means an all_gather corruption on that
# layer propagated through GEMM compute -- something the rank-fill pattern check
# on full_param alone cannot catch.
Comment thread src/aorta/race/config.py
Comment on lines +166 to +172
When True, all num_layers layers use the same weight matrix initialized with
a fixed seed, and each layer receives the same fixed reference input rather
than chaining activations. This makes per-layer forward outputs analytically
identical so that cross-layer activation comparison can serve as a secondary
corruption signal: any all_gather corruption that propagates through GEMM
compute will produce a mismatch between layer outputs.

Comment on lines +190 to +200
@staticmethod
def _checksum(tensor: torch.Tensor) -> int:
"""
Bitwise checksum: reinterpret-cast to int16 and sum.

bf16 (or any 16-bit dtype) is viewed as int16 so every bit pattern
contributes to the checksum with zero information loss -- no float
rounding, no abs(), and NaN / denorm bit patterns are included.
Accumulation is done in int64 to avoid overflow.
"""
return tensor.view(torch.int16).to(torch.int64).sum().item()
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings June 4, 2026 22:51

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

Comment on lines +218 to +222
use_shared = (
self.config.shared_layer_weights
and self.config.compute_type == "transformer"
and self.reference_input is not None
)
Comment on lines 236 to +240
if self.config.simulate_compute and self.weight_matrices:
# Use batch_gpu for data dependency on first layer (H2D race opportunity)
if layer_idx == 0:
dim = self._dim
batch_slice = self.batch_gpu[:dim * dim]
self.activation = batch_slice.view(dim, dim)

self.activation = torch.mm(
self.weight_matrices[layer_idx], self.activation
)
self.activation = torch.nn.functional.gelu(self.activation)
if use_shared:
compute_input_cksum = self._checksum(self.reference_input)

out = torch.mm(self.weight_matrices[layer_idx], self.reference_input)
Comment thread src/aorta/race/config.py
Comment on lines +166 to +172
When True, all num_layers layers use the same weight matrix initialized with
a fixed seed, and each layer receives the same fixed reference input rather
than chaining activations. This makes per-layer forward outputs analytically
identical so that cross-layer activation comparison can serve as a secondary
corruption signal: any all_gather corruption that propagates through GEMM
compute will produce a mismatch between layer outputs.

Comment on lines +101 to +106
# rather than chaining activations layer-to-layer. This makes every layer's
# forward output analytically identical so _verify_layer_activations() can serve
# as a second independent corruption signal alongside the collective-buffer
# pattern check: a mismatch across layers means an all_gather corruption on that
# layer propagated through GEMM compute -- something the rank-fill pattern check
# on full_param alone cannot catch.
mycpuorg and others added 2 commits June 4, 2026 16:01
_VALID_DTYPES only listed the verbose names (bfloat16, float16,
float32).  The base.py dtype map already accepts bf16/fp16/fp32 but
the RaceWorkload adapter raised ValueError before reaching it,
causing every trial to exit during setup() and be classified as
did_not_run by the triage runner.
…nsformer-checksums' into users/mycpuorg/shared-weight-transformer-checksums
rounding, no abs(), and NaN / denorm bit patterns are included.
Accumulation is done in int64 to avoid overflow.
"""
return tensor.view(torch.int16).to(torch.int64).sum().item()

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to point out that overflow is expected. And since this checksum only cares about the binary value, not integer semantics. int16 or int64 affects more of "hash-collision" rather than "overflow".

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants