Skip to content

race/fsdp: real transformer compute + observable per-layer checksum detector#211

Merged
oyazdanb merged 6 commits into
mainfrom
users/oyazdanb/race-transformer-compute
Jun 5, 2026
Merged

race/fsdp: real transformer compute + observable per-layer checksum detector#211
oyazdanb merged 6 commits into
mainfrom
users/oyazdanb/race-transformer-compute

Conversation

@oyazdanb

@oyazdanb oyazdanb commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

Summary

Builds on #210 (shared-weight transformer + int16 per-layer checksums) and makes its compute_type: transformer path actually exercise a real transformer, observable, and validated. This branch was cut from #210's branch, so its diff includes #210's commits — intended to supersede / land on top of #210 (coordinate with @mycpuorg on merge order).

Three changes:

  1. Real transformer compute. compute_type: transformer previously did a single gelu(weight_matrix @ reference_input) matmul — not a transformer, and ~0.0 ms/step (no real memory pressure). It now runs a real RepeatedTransformerBlock (the model llm_determinism uses: real MHA + GLU FFN + LayerNorm), borrowed from aorta.models (no new model file). One shared block (fixed seed, rank-identical via fork_rng) + a fixed 3D reference input keeps every layer's output byte-identical, so race/fsdp: shared-weight transformer with per-kernel int16 checksum verification #210's per-layer checksum invariant still holds — while supplying real-model L2/HBM pressure. The explicit all_gather/reduce_scatter (the AINIC data path under test) are untouched.

  2. Observability. A clean run previously emitted no checksum signal, so a green result was indistinguishable from a no-op. Now layers_verified + layer_checksum_mismatches (+ compute_type) are surfaced in WorkloadResult.metrics, and the FSDP startup log names the active compute path (transformer_block=active layer_checksum_verify=ON) so a silent GEMM fallback is greppable. layers_verified > 0 proves the detector ran.

  3. Validation. Unknown compute_type values now raise (a typo like transfomer used to silently fall back to GEMM = false green); shared_layer_weights without compute_type=transformer warns.

New ReproducerConfig fields: num_heads, ffn_size, seq_len, batch_size.

Not real FSDP (by design)

This harness simulates the FSDP comm pattern with explicit collectives; it does not use torch.distributed FSDP1/FSDP2. RepeatedTransformerBlock is a compute kernel only — not under fully_shard. Explicit collectives are what give the clean rank-fill + shared-input checksum invariant. Real-FSDP coverage is a separate deferred workload (PR D).

Test plan

CPU-only, no GPU / no torch.distributed — runs in ~2 s:

  • pytest tests/workloads/test_race_transformer_smoke.py -v — drives the real block + checksum verifier end to end: real forward runs, shared weights make layers byte-identical, injected corruption is caught and localized to compute.
  • pytest tests/workloads/test_race_checksums.py -v — verifier unit tests (clean pass, comm-corruption localized, compute-corruption localized, edge cases) incl. the new layers_verified/mismatches counters.
  • pytest tests/workloads/test_race.py -v — adds compute_type validation cases. (15 race tests pass together.)
  • Cluster: install branch (pip install -e .) on the 5-node MI355X AINIC, run recipes/ainic-gdr-flush-sdc.yaml; confirm metrics show compute_type: transformer, layers_verified > 0, non-zero avg_step_time_ms. (in progress)

🤖 Generated with Claude Code

oyazdanb and others added 2 commits June 5, 2026 11:32
…etector

Builds on PR #210 (shared-weight transformer + int16 checksums). Three changes:

1. compute_type=transformer now runs a REAL transformer block borrowed from
   aorta.models.RepeatedTransformerBlock (the model llm_determinism uses: real
   MHA + GLU FFN + LayerNorm) instead of a single gelu(W @ x) matmul. One
   shared block (fixed seed, rank-identical via fork_rng) + a fixed 3D
   reference input keeps every layer's output byte-identical, preserving the
   per-layer checksum invariant while supplying real-model L2/HBM pressure.
   Explicit all_gather/reduce_scatter (the AINIC data path) are untouched.

2. Observability: a clean run previously emitted no checksum signal, so green
   was indistinguishable from a no-op. Track layers_verified and
   layer_checksum_mismatches and surface them (plus compute_type) in
   WorkloadResult.metrics; the FSDP startup log now names the active compute
   path so a silent GEMM fallback is greppable.

3. Validation: reject unknown compute_type values (a typo like "transfomer"
   used to silently fall back to GEMM = false green) and warn when
   shared_layer_weights is set without compute_type=transformer.

New config fields: num_heads, ffn_size, seq_len, batch_size.

Tests (CPU-only, no GPU/dist): test_race_transformer_smoke.py drives the real
block + checksum verifier end to end (clean pass + injected-corruption catch);
test_race_checksums.py covers the verifier; test_race.py adds compute_type
validation cases. 15 race tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Module docstring note: mode=fsdp simulates the FSDP comm pattern with explicit
all_gather/reduce_scatter; it does not use FSDP1/FSDP2. RepeatedTransformerBlock
is reused as a compute kernel only, not under fully_shard. Pre-empts the
"why not FSDP2 like llm_determinism?" question; real-FSDP is the deferred PR D.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 5, 2026 16:07

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Updates the race workload’s simulated FSDP mode so compute_type: transformer can run a real transformer block (via aorta.models.RepeatedTransformerBlock), and makes the per-layer checksum detector observable and safer to configure (validation + metrics). This strengthens the AINIC reproducer by ensuring “green” runs prove the detector executed and by increasing compute realism/pressure without switching to real FSDP.

Changes:

  • Implement real transformer-block compute for the shared-weight transformer path in FSDPModeReproducer, plus checksum observability counters.
  • Validate compute_type (reject typos) and surface compute_type / checksum verification counters into WorkloadResult.metrics.
  • Add CPU-only unit + smoke tests covering config validation and per-layer checksum verification behavior.

Reviewed changes

Copilot reviewed 9 out of 9 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
src/aorta/race/modes/fsdp.py Runs a real RepeatedTransformerBlock on the shared transformer path; records and verifies per-layer checksums; logs active compute path.
src/aorta/race/config.py Adds transformer-shape knobs (num_heads, ffn_size, seq_len, batch_size) and result counters for detector observability.
src/aorta/race/base.py Adds dtype aliases and plumbs checksum observability counters into ReproducerResult.
src/aorta/workloads/race.py Validates compute_type, warns on inert config combos, and exports detector metrics via WorkloadResult.metrics.
tests/workloads/test_race.py Adds tests for compute_type validation and warning behavior.
tests/workloads/test_race_checksums.py Unit tests for _verify_layer_checksums (clean pass + corruption localization + edge cases).
tests/workloads/test_race_transformer_smoke.py CPU smoke test that executes a real transformer block and exercises the checksum verifier end-to-end.
recipes/ainic-gdr-flush-sdc.yaml Switches the AINIC recipe to compute_type: transformer + shared weights for stronger checksum invariants.
docs/design/race-transformer-compute-support.md Adds a design/plan document describing the intended transformer compute + observability approach.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +187 to +191
with torch.random.fork_rng(devices=["cuda"]):
torch.cuda.manual_seed(0)
self.shared_block = (
RepeatedTransformerBlock(block_cfg).to("cuda").to(self.dtype)
)
Comment thread src/aorta/race/config.py Outdated
Comment on lines +184 to +190
Use a single shared weight matrix for all layers (transformer compute type).

When True, all num_layers layers use the same weight matrix initialized with
a fixed seed, and each layer receives the same fixed reference input rather
than chaining activations. This makes per-layer forward outputs analytically
identical so that cross-layer activation comparison can serve as a secondary
corruption signal: any all_gather corruption that propagates through GEMM
Comment thread src/aorta/race/modes/fsdp.py Outdated
Comment on lines +459 to +464
Four checksums per layer:
comm_input -- param shard before all_gather (should be identical:
every shard is filled with float(rank))
comm_output -- full_param after all_gather
compute_input -- reference_input fed to GEMM (constant across layers)
compute_output-- activation after GELU
Comment thread recipes/ainic-gdr-flush-sdc.yaml Outdated
Comment on lines +100 to +102
# and each layer independently receives the same fixed reference input (seed=1)
# rather than chaining activations layer-to-layer. This makes every layer's
# forward output analytically identical so _verify_layer_activations() can serve
Comment on lines +1 to +5
# Plan — Fully support `compute_type: transformer` + shared-weight checksums (PR #210)

**Branch:** `users/oyazdanb/race-transformer-compute` (off `main`)
**Status:** DRAFT — for review before implementation
**Related:** PR #210 (`users/mycpuorg/shared-weight-transformer-checksums`), task `race__cross-rank-and-iter0-detection.md`
oyazdanb and others added 2 commits June 5, 2026 12:17
num_heads and ffn_size auto-derive when 0 (model_dim//128, model_dim*4), so the
config value alone doesn't record what actually ran. Capture the resolved values
and surface them in the FSDP startup log (resolved_shape=...) and in
WorkloadResult.metrics (eff_num_heads/eff_ffn_size/eff_seq_len/eff_batch_size),
so the transformer shape a run used is provable from the result JSON. getattr in
base.run keeps it safe for non-FSDP modes. Adds an auto-derive unit test.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
- fsdp.py: also torch.manual_seed(0) inside fork_rng. RepeatedTransformerBlock
  inits params on CPU before .to("cuda"), so seeding only the CUDA RNG left the
  shared block's weights dependent on each rank's CPU RNG state -- ranks could
  diverge while the intra-rank per-layer checksum still passed (false green).
  Seeding CPU RNG too restores the rank-identical invariant the comment claims.
- config.py / fsdp.py / recipe: update shared_layer_weights + _verify_layer_checksums
  docstrings and the recipe comment that still described the old GEMM (gelu(W@x))
  path and the wrong function name (_verify_layer_activations); the shared path
  now runs a real RepeatedTransformerBlock.
- remove docs/design/race-transformer-compute-support.md from the PR (it was a
  pre-implementation draft; the PR description carries the summary).

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 5, 2026 16:22

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comment on lines +505 to +525
# Count every cross-layer comparison so a clean (green) run still
# proves the detector ran: layers_verified > 0.
self.layers_verified += 1
for key in ("comm_input", "comm_output", "compute_input", "compute_output"):
if cmp[key] != ref[key]:
log.error(
f"LAYER_CHECKSUM_MISMATCH ({key}): "
f"rank={self.rank} iter={iteration} "
f"layer_0={ref[key]} layer_{i}={cmp[key]}"
)
self.corruption_details.append({
"type": f"layer_checksum_mismatch_{key}",
"rank": self.rank,
"iteration": iteration,
"layer_ref": 0,
"layer_cmp": i,
"ref_checksum": ref[key],
"cmp_checksum": cmp[key],
})
self.layer_checksum_mismatches += 1
all_correct = False
Comment thread src/aorta/race/modes/fsdp.py Outdated
Comment on lines +283 to +287
Shared-weight path: every layer receives the same fixed reference_input
so that outputs are analytically identical. Input/output checksums are
recorded for both the comm kernel (all_gather) and the compute kernel
(GEMM + GELU) so _verify_layer_checksums() can pinpoint whether
corruption entered during communication or compute.
Comment thread src/aorta/race/modes/fsdp.py Outdated
Comment on lines 223 to 234
@@ -150,11 +233,23 @@ def setup_buffers(self) -> None:
dtype=self.dtype, device="cuda",
)
Comment on lines 164 to +168
if cfg.simulate_compute:
dim = self._dim
self.weight_matrices = [
torch.randn(
dim, dim,
dtype=self.dtype, device="cuda",
use_shared = (
cfg.shared_layer_weights and cfg.compute_type == "transformer"
)
… doc)

- layer_checksum_mismatches now counts once per CORRUPTED LAYER, not once per
  checksum key. A single bad layer could previously inflate it up to 4x
  (comm_in/comm_out/compute_in/compute_out), contradicting the per-layer
  docstring. corruption_details still records every key for localization.
- Validate compute_type in ReproducerConfig.__post_init__ so EVERY entry point
  is covered (the aorta.race CLI / direct construction), not just the
  RaceWorkload adapter -- a typo like "transfomer" now raises instead of a
  silent GEMM false-green.
- Skip the unused dim x dim activation/grad_buffer allocations on the
  shared-weight transformer path (forward sets activation to the block output;
  backward re-runs the block) -- avoids needless GPU memory / OOM risk at large
  model_dim. GEMM/chained path keeps them.
- Fix the _forward_layer docstring that still said "GEMM + GELU" on the shared
  path (now runs RepeatedTransformerBlock).

Tests: per-layer-count (4 keys bad -> counter==1), direct-construction
compute_type rejection. 18 race tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 8 out of 8 changed files in this pull request and generated 4 comments.

Comment thread src/aorta/race/config.py
Comment on lines +294 to +304
def __post_init__(self) -> None:
# Validate here (not only in the RaceWorkload adapter) so EVERY entry
# point is covered -- the aorta.race CLI and any direct reproducer
# construction. A typo like "transfomer" must error, not silently fall
# back to the GEMM path (false green).
valid_compute = {"gemm", "transformer"}
if self.compute_type not in valid_compute:
raise ValueError(
f"compute_type must be one of {sorted(valid_compute)}, "
f"got {self.compute_type!r}"
)
Comment thread src/aorta/race/modes/fsdp.py Outdated
Comment on lines +268 to +277
def _checksum(tensor: torch.Tensor) -> int:
"""
Bitwise checksum: reinterpret-cast to int16 and sum.

bf16 (or any 16-bit dtype) is viewed as int16 so every bit pattern
contributes to the checksum with zero information loss -- no float
rounding, no abs(), and NaN / denorm bit patterns are included.
Accumulation is done in int64 to avoid overflow.
"""
return tensor.view(torch.int16).to(torch.int64).sum().item()
Comment on lines +166 to +168
use_shared = (
cfg.shared_layer_weights and cfg.compute_type == "transformer"
)
Comment on lines +54 to +60
"""Per-layer forward + 4 checksums, exactly like _forward_layer's shared path."""
layer_checksums = []
for _ in range(NUM_LAYERS):
comm_input = FSDPModeReproducer._checksum(reference_input)
comm_output = comm_input # no real all_gather on CPU; identical by construction
compute_input = FSDPModeReproducer._checksum(reference_input)
with torch.no_grad():
…istry, test)

- _checksum is now element-size aware: 2-byte dtypes -> int16, 4-byte (fp32) ->
  int32, etc. Previously hard-coded torch.int16, which would crash/miscount on
  float32 (an allowed dtype). bf16 path unchanged.
- compute_type='transformer' with shared_layer_weights=False now warns loudly
  that it runs the GEMM path (only the shared transformer path is implemented) --
  no more silent transformer->GEMM fallback.
- ReproducerConfig.__post_init__ validates compute_type against the pluggable
  COMPUTE_REGISTRY (single source of truth; "transformer" is already registered)
  instead of a hard-coded set, so custom register_compute() backends stay valid.
  Lazy import avoids a circular import (compute.py only TYPE_CHECKING-imports config).
- Smoke test: comm_input/comm_output checksums now use distinct rank-fill shard /
  all_gather-result buffers (mirroring real _forward_layer) instead of
  reference_input, so the test actually exercises the comm path. Added a
  multi-dtype _checksum test (bf16/fp16/fp32).

19 race tests pass.

Co-Authored-By: Claude Opus 4 <noreply@anthropic.com>
@oyazdanb oyazdanb merged commit d5e3340 into main Jun 5, 2026
1 check passed
@oyazdanb oyazdanb deleted the users/oyazdanb/race-transformer-compute branch June 5, 2026 18:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants