race/fsdp: shared-weight transformer with per-kernel int16 checksum verification#210
race/fsdp: shared-weight transformer with per-kernel int16 checksum verification#210mycpuorg wants to merge 6 commits into
Conversation
The dtype map in BaseReproducer only accepted the verbose PyTorch names (bfloat16, float16, float32). Add the common short-form aliases so recipes and CLI args can use either form.
Add shared_layer_weights config field for the FSDP reproducer. When enabled with compute_type=transformer, all layers share a single weight matrix (fixed seed) and receive the same fixed reference input so every layer's forward output is analytically identical. Per-kernel int16 checksums (reinterpret-cast bf16 -> int16, sum in int64) wrap both the comm kernel (all_gather) and compute kernel (GEMM + GELU) with input/output checksums per layer. Cross-layer checksum comparison in _verify_layer_checksums() pinpoints whether corruption entered during communication (RCCL/NIC) or compute (GPU ALU) -- a second independent signal beyond the collective-buffer pattern check.
Replace compute_type=gemm with compute_type=transformer in the ainic-gdr-flush-sdc recipe. Use shared_layer_weights=true so all 24 layers share one weight matrix and a fixed reference input, enabling cross-layer int16 checksum comparison as a second corruption signal. Also switch dtype from bfloat16 to bf16 (short alias) and drop the now-unused gemm_size/gemm_layers fields.
There was a problem hiding this comment.
Pull request overview
This PR updates the race FSDP reproducer to make cross-layer divergence a stronger corruption signal by optionally running a synthetic “identical-layer” transformer (shared weights + fixed reference input) and recording per-layer, per-kernel bitwise checksums. It also updates configuration and the AINIC GDR-flush SDC recipe to use the transformer compute path.
Changes:
- Add dtype short-form aliases (
bf16,fp16,fp32) to the race reproducer base dtype resolver. - Add
shared_layer_weightstoReproducerConfigand implement shared-weight + fixed-input transformer behavior in FSDP mode with per-layer checksum verification. - Update
ainic-gdr-flush-sdc.yamlto switch from GEMM compute to transformer compute and enableshared_layer_weights.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
src/aorta/race/modes/fsdp.py |
Adds shared-weight transformer buffers, checksum computation, and cross-layer checksum verification in FSDP mode. |
src/aorta/race/config.py |
Introduces shared_layer_weights config knob and documentation. |
src/aorta/race/base.py |
Adds dtype string aliases for bf16/fp16/fp32. |
recipes/ainic-gdr-flush-sdc.yaml |
Switches the recipe to transformer compute + shared weights and updates related documentation/comments. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| # shared_layer_weights=true: all 24 layers share one weight matrix (fixed seed=0) | ||
| # and each layer independently receives the same fixed reference input (seed=1) | ||
| # rather than chaining activations layer-to-layer. This makes every layer's | ||
| # forward output analytically identical so _verify_layer_activations() can serve | ||
| # as a second independent corruption signal alongside the collective-buffer | ||
| # pattern check: a mismatch across layers means an all_gather corruption on that | ||
| # layer propagated through GEMM compute -- something the rank-fill pattern check | ||
| # on full_param alone cannot catch. |
| When True, all num_layers layers use the same weight matrix initialized with | ||
| a fixed seed, and each layer receives the same fixed reference input rather | ||
| than chaining activations. This makes per-layer forward outputs analytically | ||
| identical so that cross-layer activation comparison can serve as a secondary | ||
| corruption signal: any all_gather corruption that propagates through GEMM | ||
| compute will produce a mismatch between layer outputs. | ||
|
|
| @staticmethod | ||
| def _checksum(tensor: torch.Tensor) -> int: | ||
| """ | ||
| Bitwise checksum: reinterpret-cast to int16 and sum. | ||
|
|
||
| bf16 (or any 16-bit dtype) is viewed as int16 so every bit pattern | ||
| contributes to the checksum with zero information loss -- no float | ||
| rounding, no abs(), and NaN / denorm bit patterns are included. | ||
| Accumulation is done in int64 to avoid overflow. | ||
| """ | ||
| return tensor.view(torch.int16).to(torch.int64).sum().item() |
Co-authored-by: Copilot Autofix powered by AI <175728472+Copilot@users.noreply.github.com>
| use_shared = ( | ||
| self.config.shared_layer_weights | ||
| and self.config.compute_type == "transformer" | ||
| and self.reference_input is not None | ||
| ) |
| if self.config.simulate_compute and self.weight_matrices: | ||
| # Use batch_gpu for data dependency on first layer (H2D race opportunity) | ||
| if layer_idx == 0: | ||
| dim = self._dim | ||
| batch_slice = self.batch_gpu[:dim * dim] | ||
| self.activation = batch_slice.view(dim, dim) | ||
|
|
||
| self.activation = torch.mm( | ||
| self.weight_matrices[layer_idx], self.activation | ||
| ) | ||
| self.activation = torch.nn.functional.gelu(self.activation) | ||
| if use_shared: | ||
| compute_input_cksum = self._checksum(self.reference_input) | ||
|
|
||
| out = torch.mm(self.weight_matrices[layer_idx], self.reference_input) |
| When True, all num_layers layers use the same weight matrix initialized with | ||
| a fixed seed, and each layer receives the same fixed reference input rather | ||
| than chaining activations. This makes per-layer forward outputs analytically | ||
| identical so that cross-layer activation comparison can serve as a secondary | ||
| corruption signal: any all_gather corruption that propagates through GEMM | ||
| compute will produce a mismatch between layer outputs. | ||
|
|
| # rather than chaining activations layer-to-layer. This makes every layer's | ||
| # forward output analytically identical so _verify_layer_activations() can serve | ||
| # as a second independent corruption signal alongside the collective-buffer | ||
| # pattern check: a mismatch across layers means an all_gather corruption on that | ||
| # layer propagated through GEMM compute -- something the rank-fill pattern check | ||
| # on full_param alone cannot catch. |
_VALID_DTYPES only listed the verbose names (bfloat16, float16, float32). The base.py dtype map already accepts bf16/fp16/fp32 but the RaceWorkload adapter raised ValueError before reaching it, causing every trial to exit during setup() and be classified as did_not_run by the triage runner.
…nsformer-checksums' into users/mycpuorg/shared-weight-transformer-checksums
| rounding, no abs(), and NaN / denorm bit patterns are included. | ||
| Accumulation is done in int64 to avoid overflow. | ||
| """ | ||
| return tensor.view(torch.int16).to(torch.int64).sum().item() |
There was a problem hiding this comment.
Just to point out that overflow is expected. And since this checksum only cares about the binary value, not integer semantics. int16 or int64 affects more of "hash-collision" rather than "overflow".
Summary
The AINIC GDR flush SDC recipe (
ainic-gdr-flush-sdc.yaml) usedcompute_type: gemmwith independently-random weight matrices per layer. This makes cross-layer comparison meaningless — gradient/activation differences across layers could be numerical divergence rather than corruption.This PR switches to a synthetic identical-layer transformer model: all layers share one weight matrix (fixed seed) and receive the same fixed reference input, so every layer's forward output is analytically identical. Any divergence across layers is a corruption signal, not numerical noise.
Per-kernel int16 checksums
Following the bitwise checksum design from the onsite whiteboard (Jun 2 2026): each tensor is reinterpret-cast from bf16 to int16 and summed in int64 — zero information loss, no float rounding, catches any bit flip.
Four checksums are recorded per layer, wrapping both kernels:
comm_inputcomm_outputcompute_inputcompute_output_verify_layer_checksums()compares all four across layers. The diagnostic pinpoints where corruption entered:comm_outputdiverges,comm_inputmatches → corruption in the collective (RCCL/NIC path)compute_outputdiverges,comm_outputmatches → corruption in GPU compute (ALU)Changes
src/aorta/race/base.py: Addbf16,fp16,fp32as dtype short-form aliasessrc/aorta/race/config.py: Addshared_layer_weights: boolfield toReproducerConfigsrc/aorta/race/modes/fsdp.py:_checksum():tensor.view(int16).to(int64).sum()— bitwise, losslesssetup_buffers(): shared weight matrix (seed=0) + fixed reference input (seed=1) whenshared_layer_weights=True_forward_layer(): records comm and compute input/output checksums per layer_verify_layer_checksums(): cross-layer comparison of all four checksum keysrecipes/ainic-gdr-flush-sdc.yaml: Switch tocompute_type: transformer,shared_layer_weights: true,dtype: bf16; drop deadgemm_size/gemm_layersfieldsTest plan
aorta triage run --recipe recipes/ainic-gdr-flush-sdc.yaml --dry-runpasses with no "ignoring unknown" warningsLAYER_CHECKSUM_MISMATCH (comm_output)is logged with correct layer index_verify_all_gather/_verify_reduce_scatterpattern checks still fire independently