Add simple_transformer workload for pipelined forward pass evaluation by oyazdanb · Pull Request #118 · ROCm/aorta

oyazdanb · 2026-02-27T23:37:51Z

Adds a GPT-2-style decoder-only transformer training workload that pipelines the forward pass across CUDA streams. Each layer group runs on its own stream with explicit inter-stream dependencies, exercising hardware queue switching under a realistic training loop. Classified as P1 priority. Includes documentation with SLURM + Docker usage.

Adds a GPT-2-style decoder-only transformer training workload that pipelines the forward pass across CUDA streams. Each layer group runs on its own stream with explicit inter-stream dependencies, exercising hardware queue switching under a realistic training loop. Classified as P1 priority. Includes documentation with SLURM + Docker usage. Made-with: Cursor

Copilot

Pull request overview

Adds a new simple_transformer workload to the HW queue evaluation suite, intended to emulate a GPT-2-like training loop while pipelining the forward pass across multiple CUDA streams to stress hardware queue switching.

Changes:

Introduces SimpleTransformerWorkload (decoder-style transformer training workload with multi-stream forward pipelining).
Registers/exports the new workload via the pipeline package and surfaces it as a P1 workload in the CLI.
Documents how to run the new workload locally and on SLURM+Docker.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File	Description
`src/aorta/hw_queue_eval/workloads/pipeline/simple_transformer.py`	Implements the transformer model + pipelined multi-stream training iteration logic and config reporting.
`src/aorta/hw_queue_eval/workloads/pipeline/__init__.py`	Exports the new workload and mentions it in the pipeline module docs.
`src/aorta/hw_queue_eval/cli.py`	Adds the workload to the P1 set and prints its purpose text.
`docs/hw-queue-eval.md`	Adds the workload to the documentation table and provides detailed run instructions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-02-27T23:44:46Z

+    multi_gpu_capable = True
+
+    def __init__(
+        self,
+        hidden_size: int = 512,
+        num_layers: int = 6,
+        num_heads: int = 8,
+        batch_size: int = 8,
+        seq_length: int = 128,
+        vocab_size: int = 32000,
+        learning_rate: float = 1e-3,
+        use_multi_gpu: bool = True,
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+        self.num_heads = num_heads
+        self._batch_size = batch_size
+        self.seq_length = seq_length
+        self.vocab_size = vocab_size
+        self.learning_rate = learning_rate
+        self.use_multi_gpu = use_multi_gpu
+
+        self._optimizer: Optional[torch.optim.Optimizer] = None
+        self._input_ids: Optional[torch.Tensor] = None
+        self._labels: Optional[torch.Tensor] = None
+        self._loss_fn: Optional[nn.Module] = None
+
+        self._layer_groups: List[tuple] = []
+        self._fwd_streams: List[int] = []
+        self._data_stream: int = 0
+        self._loss_stream: int = 0
+        self._devices: List[str] = []
+        self._stream_to_device: Dict[int, str] = {}
+
+    def setup(self, stream_count: int, device: str = "cuda:0") -> None:
+        self._stream_count = stream_count
+        self._is_setup = True
+
+        self._setup_multi_gpu(stream_count, device, self.use_multi_gpu)
+
+        primary_device = self._get_device_for_stream(0)
+
+        self._model = SimpleTransformerModel(
+            vocab_size=self.vocab_size,
+            hidden_size=self.hidden_size,
+            num_layers=self.num_layers,
+            num_heads=self.num_heads,
+            max_seq_len=self.seq_length,
+        ).to(primary_device)
+        self._model.train()
+


multi_gpu_capable = True / use_multi_gpu=True but the implementation keeps the model and all tensors on primary_device = _get_device_for_stream(0) and then runs work on arbitrary streams[...]. When the harness is configured for multi-GPU stream distribution (default in HarnessConfig), some of those streams will be created on other devices, causing device-mismatch errors and wait_stream not being valid across devices. Either mark this workload as single-GPU (drop MultiGPUMixin, set multi_gpu_capable=False, default use_multi_gpu=False) or explicitly move/replicate tensors per stream device and handle inter-device transfers/events like the moe workload does.

Copilot · 2026-02-27T23:44:47Z

+        seq_len = self._input_ids.size(1)
+        positions = torch.arange(seq_len, device=self._input_ids.device)
+
+        prev_stream = data_stream
+        hidden = None
+
+        for group_idx, (layer_start, layer_end) in enumerate(self._layer_groups):
+            fwd_idx = self._fwd_streams[group_idx % len(self._fwd_streams)]
+            fwd_stream = streams[fwd_idx]
+
+            fwd_stream.wait_stream(prev_stream)
+
+            with torch.cuda.stream(fwd_stream):
+                if group_idx == 0:
+                    hidden = (
+                        model.embed_tokens(self._input_ids)
+                        + model.embed_pos(positions)
+                    )
+                hidden = model.forward_layers(hidden, layer_start, layer_end)


positions = torch.arange(...) is created outside any torch.cuda.stream(...) context, so it is enqueued on the default stream for the device. It’s then consumed inside with torch.cuda.stream(fwd_stream): without any dependency on the default stream, which can introduce a cross-stream race. Consider precomputing/caching a positions tensor in setup() (or creating it inside the first forward stream after wait_stream) so its producer stream is properly ordered before use, and to avoid per-iteration allocations.

Copilot · 2026-02-27T23:44:47Z

+class SimpleTransformerModel(nn.Module):
+    """Small GPT-2-style decoder-only transformer for single-GPU training."""
+
+    def __init__(
+        self,
+        vocab_size: int = 32000,
+        hidden_size: int = 512,
+        num_layers: int = 6,
+        num_heads: int = 8,
+        max_seq_len: int = 256,
+        dropout: float = 0.0,
+    ):
+        super().__init__()
+        self.hidden_size = hidden_size
+        self.num_layers = num_layers
+
+        self.embed_tokens = nn.Embedding(vocab_size, hidden_size)
+        self.embed_pos = nn.Embedding(max_seq_len, hidden_size)
+
+        self.layers = nn.ModuleList([
+            nn.TransformerEncoderLayer(
+                d_model=hidden_size,
+                nhead=num_heads,
+                dim_feedforward=hidden_size * 4,
+                dropout=dropout,
+                batch_first=True,
+            )
+            for _ in range(num_layers)
+        ])


The model is described as “GPT-2-style decoder-only”, but it uses nn.TransformerEncoderLayer with no causal mask (layer(x)), which allows attending to future tokens and is not decoder-only behavior. Either apply a causal mask / is_causal in the layer forward (or switch to a decoder-style block) or adjust the module/workload/docs wording so it doesn’t claim GPT-2/decoder-only semantics.

Copilot · 2026-02-27T23:44:47Z

+        # --- Optimizer step on stream 0, overlapping with next data prep ---
+        data_stream.wait_stream(loss_stream)
+
+        with torch.cuda.stream(data_stream):
+            self._optimizer.step()
+            self._optimizer.zero_grad()


The comments claim the optimizer step overlaps with the next iteration’s data prep, but both are issued onto data_stream and the next iteration’s forward explicitly waits on data_stream, so there’s no opportunity for overlap—everything on stream 0 is serialized and gates the forward. Either update the description to match the actual scheduling, or move the optimizer step to a separate stream / reorder dependencies to create real overlap if that’s a goal of this workload.

Copilot · 2026-02-27T23:44:47Z

+            hidden = model.ln_f(hidden)
+            logits = model.lm_head(hidden)
+            loss = self._loss_fn(
+                logits.view(-1, self.vocab_size), self._labels.view(-1)


logits.view(-1, self.vocab_size) assumes logits is contiguous; if it isn’t (e.g., due to upstream layout changes), .view will raise at runtime. Using .reshape(-1, self.vocab_size) (or .flatten(0, 1) for the first two dims) avoids this fragility without changing semantics.

Suggested change

logits.view(-1, self.vocab_size), self._labels.view(-1)

logits.reshape(-1, self.vocab_size), self._labels.reshape(-1)

Copilot · 2026-02-27T23:44:47Z

+@WorkloadRegistry.register
+class SimpleTransformerWorkload(MultiGPUMixin, ModelWorkload):
+    """
+    Single-GPU transformer training with multi-stream pipelining.
+
+    Splits the model's layers into groups and pipelines the forward pass
+    across CUDA streams.  After the final layer group, loss and backward
+    run on a dedicated stream, then the optimizer step is issued on the
+    first stream so it can overlap with the next iteration's data prep.
+
+    Stream assignment (for stream_count=4 as example):
+      - Stream 0: data prep / optimizer step
+      - Stream 1: forward layers 0-1
+      - Stream 2: forward layers 2-3
+      - Stream 3: forward layers 4-5, loss, backward
+    """
+
+    name = "simple_transformer"
+    description = "Simple transformer training with pipelined forward pass"
+    category = "pipeline"
+    min_streams = 2
+    max_streams = 16
+    recommended_streams = 4
+    switch_latency_sensitivity = "medium"
+    memory_requirements_gb = 2.0
+    multi_gpu_capable = True
+


This new workload isn’t covered by the existing workload smoke tests (see tests/hw_queue_eval/test_workloads.py, which runs several registered workloads through StreamHarness). Please add a small test that instantiates simple_transformer with reduced sizes (e.g., fewer layers/shorter seq) and verifies it runs at a couple stream counts and returns positive throughput, so regressions in stream dependency logic are caught early.

Copilot · 2026-02-27T23:44:48Z

+## Running the Simple Transformer Workload
+
+The `simple_transformer` workload implements a GPT-2-style decoder-only transformer with multi-stream pipelined training. Layers are split into groups, each assigned to a different CUDA stream. The forward pass is pipelined (stream K+1 waits on stream K), loss and backward run on a dedicated stream, and the optimizer step overlaps with the next iteration's data prep.
+


The docs describe this workload as a “GPT-2-style decoder-only transformer” with pipelined training, but the current implementation uses nn.TransformerEncoderLayer without a causal mask (so it isn’t decoder-only). Please either update the documentation wording to match the actual behavior, or update the implementation to enforce causal attention so the GPT-2 comparison is accurate.

Copilot · 2026-02-27T23:44:48Z

            "  with computation for memory-constrained training."
        ),
+        "simple_transformer": (
+            "  GPT-2-style transformer training with pipelined forward pass across\n"


CLI purpose text calls this “GPT-2-style” training, but the implementation currently uses nn.TransformerEncoderLayer without causal masking (not decoder-only / GPT-2-like). Consider adjusting this description (and/or the implementation) so CLI help output is accurate and consistent with the workload’s actual attention semantics.

Suggested change

" GPT-2-style transformer training with pipelined forward pass across\n"

" Transformer encoder training workload with pipelined forward pass across\n"

Copilot AI review requested due to automatic review settings February 27, 2026 23:37

Copilot AI reviewed Feb 27, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add simple_transformer workload for pipelined forward pass evaluation#118

Add simple_transformer workload for pipelined forward pass evaluation#118
oyazdanb wants to merge 1 commit into
mainfrom
add-simple-transformer-workload

oyazdanb commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Copilot AI Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

	logits.view(-1, self.vocab_size), self._labels.view(-1)
	logits.reshape(-1, self.vocab_size), self._labels.reshape(-1)

		## Running the Simple Transformer Workload

		The `simple_transformer` workload implements a GPT-2-style decoder-only transformer with multi-stream pipelined training. Layers are split into groups, each assigned to a different CUDA stream. The forward pass is pipelined (stream K+1 waits on stream K), loss and backward run on a dedicated stream, and the optimizer step overlaps with the next iteration's data prep.

	" GPT-2-style transformer training with pipelined forward pass across\n"
	" Transformer encoder training workload with pipelined forward pass across\n"

Uh oh!

Conversation

oyazdanb commented Feb 27, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants