Skip to content

Add simple_transformer workload for pipelined forward pass evaluation#118

Open
oyazdanb wants to merge 1 commit into
mainfrom
add-simple-transformer-workload
Open

Add simple_transformer workload for pipelined forward pass evaluation#118
oyazdanb wants to merge 1 commit into
mainfrom
add-simple-transformer-workload

Conversation

@oyazdanb

Copy link
Copy Markdown
Collaborator

Adds a GPT-2-style decoder-only transformer training workload that pipelines the forward pass across CUDA streams. Each layer group runs on its own stream with explicit inter-stream dependencies, exercising hardware queue switching under a realistic training loop. Classified as P1 priority. Includes documentation with SLURM + Docker usage.

Adds a GPT-2-style decoder-only transformer training workload that
pipelines the forward pass across CUDA streams. Each layer group runs
on its own stream with explicit inter-stream dependencies, exercising
hardware queue switching under a realistic training loop. Classified
as P1 priority. Includes documentation with SLURM + Docker usage.

Made-with: Cursor
Copilot AI review requested due to automatic review settings February 27, 2026 23:37

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new simple_transformer workload to the HW queue evaluation suite, intended to emulate a GPT-2-like training loop while pipelining the forward pass across multiple CUDA streams to stress hardware queue switching.

Changes:

  • Introduces SimpleTransformerWorkload (decoder-style transformer training workload with multi-stream forward pipelining).
  • Registers/exports the new workload via the pipeline package and surfaces it as a P1 workload in the CLI.
  • Documents how to run the new workload locally and on SLURM+Docker.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.

File Description
src/aorta/hw_queue_eval/workloads/pipeline/simple_transformer.py Implements the transformer model + pipelined multi-stream training iteration logic and config reporting.
src/aorta/hw_queue_eval/workloads/pipeline/__init__.py Exports the new workload and mentions it in the pipeline module docs.
src/aorta/hw_queue_eval/cli.py Adds the workload to the P1 set and prints its purpose text.
docs/hw-queue-eval.md Adds the workload to the documentation table and provides detailed run instructions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +109 to +160
multi_gpu_capable = True

def __init__(
self,
hidden_size: int = 512,
num_layers: int = 6,
num_heads: int = 8,
batch_size: int = 8,
seq_length: int = 128,
vocab_size: int = 32000,
learning_rate: float = 1e-3,
use_multi_gpu: bool = True,
):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers
self.num_heads = num_heads
self._batch_size = batch_size
self.seq_length = seq_length
self.vocab_size = vocab_size
self.learning_rate = learning_rate
self.use_multi_gpu = use_multi_gpu

self._optimizer: Optional[torch.optim.Optimizer] = None
self._input_ids: Optional[torch.Tensor] = None
self._labels: Optional[torch.Tensor] = None
self._loss_fn: Optional[nn.Module] = None

self._layer_groups: List[tuple] = []
self._fwd_streams: List[int] = []
self._data_stream: int = 0
self._loss_stream: int = 0
self._devices: List[str] = []
self._stream_to_device: Dict[int, str] = {}

def setup(self, stream_count: int, device: str = "cuda:0") -> None:
self._stream_count = stream_count
self._is_setup = True

self._setup_multi_gpu(stream_count, device, self.use_multi_gpu)

primary_device = self._get_device_for_stream(0)

self._model = SimpleTransformerModel(
vocab_size=self.vocab_size,
hidden_size=self.hidden_size,
num_layers=self.num_layers,
num_heads=self.num_heads,
max_seq_len=self.seq_length,
).to(primary_device)
self._model.train()

Copilot AI Feb 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

multi_gpu_capable = True / use_multi_gpu=True but the implementation keeps the model and all tensors on primary_device = _get_device_for_stream(0) and then runs work on arbitrary streams[...]. When the harness is configured for multi-GPU stream distribution (default in HarnessConfig), some of those streams will be created on other devices, causing device-mismatch errors and wait_stream not being valid across devices. Either mark this workload as single-GPU (drop MultiGPUMixin, set multi_gpu_capable=False, default use_multi_gpu=False) or explicitly move/replicate tensors per stream device and handle inter-device transfers/events like the moe workload does.

Copilot uses AI. Check for mistakes.
Comment on lines +208 to +226
seq_len = self._input_ids.size(1)
positions = torch.arange(seq_len, device=self._input_ids.device)

prev_stream = data_stream
hidden = None

for group_idx, (layer_start, layer_end) in enumerate(self._layer_groups):
fwd_idx = self._fwd_streams[group_idx % len(self._fwd_streams)]
fwd_stream = streams[fwd_idx]

fwd_stream.wait_stream(prev_stream)

with torch.cuda.stream(fwd_stream):
if group_idx == 0:
hidden = (
model.embed_tokens(self._input_ids)
+ model.embed_pos(positions)
)
hidden = model.forward_layers(hidden, layer_start, layer_end)

Copilot AI Feb 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

positions = torch.arange(...) is created outside any torch.cuda.stream(...) context, so it is enqueued on the default stream for the device. It’s then consumed inside with torch.cuda.stream(fwd_stream): without any dependency on the default stream, which can introduce a cross-stream race. Consider precomputing/caching a positions tensor in setup() (or creating it inside the first forward stream after wait_stream) so its producer stream is properly ordered before use, and to avoid per-iteration allocations.

Copilot uses AI. Check for mistakes.
Comment on lines +26 to +54
class SimpleTransformerModel(nn.Module):
"""Small GPT-2-style decoder-only transformer for single-GPU training."""

def __init__(
self,
vocab_size: int = 32000,
hidden_size: int = 512,
num_layers: int = 6,
num_heads: int = 8,
max_seq_len: int = 256,
dropout: float = 0.0,
):
super().__init__()
self.hidden_size = hidden_size
self.num_layers = num_layers

self.embed_tokens = nn.Embedding(vocab_size, hidden_size)
self.embed_pos = nn.Embedding(max_seq_len, hidden_size)

self.layers = nn.ModuleList([
nn.TransformerEncoderLayer(
d_model=hidden_size,
nhead=num_heads,
dim_feedforward=hidden_size * 4,
dropout=dropout,
batch_first=True,
)
for _ in range(num_layers)
])

Copilot AI Feb 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The model is described as “GPT-2-style decoder-only”, but it uses nn.TransformerEncoderLayer with no causal mask (layer(x)), which allows attending to future tokens and is not decoder-only behavior. Either apply a causal mask / is_causal in the layer forward (or switch to a decoder-style block) or adjust the module/workload/docs wording so it doesn’t claim GPT-2/decoder-only semantics.

Copilot uses AI. Check for mistakes.
Comment on lines +242 to +247
# --- Optimizer step on stream 0, overlapping with next data prep ---
data_stream.wait_stream(loss_stream)

with torch.cuda.stream(data_stream):
self._optimizer.step()
self._optimizer.zero_grad()

Copilot AI Feb 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comments claim the optimizer step overlaps with the next iteration’s data prep, but both are issued onto data_stream and the next iteration’s forward explicitly waits on data_stream, so there’s no opportunity for overlap—everything on stream 0 is serialized and gates the forward. Either update the description to match the actual scheduling, or move the optimizer step to a separate stream / reorder dependencies to create real overlap if that’s a goal of this workload.

Copilot uses AI. Check for mistakes.
hidden = model.ln_f(hidden)
logits = model.lm_head(hidden)
loss = self._loss_fn(
logits.view(-1, self.vocab_size), self._labels.view(-1)

Copilot AI Feb 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logits.view(-1, self.vocab_size) assumes logits is contiguous; if it isn’t (e.g., due to upstream layout changes), .view will raise at runtime. Using .reshape(-1, self.vocab_size) (or .flatten(0, 1) for the first two dims) avoids this fragility without changing semantics.

Suggested change
logits.view(-1, self.vocab_size), self._labels.view(-1)
logits.reshape(-1, self.vocab_size), self._labels.reshape(-1)

Copilot uses AI. Check for mistakes.
Comment on lines +84 to +110
@WorkloadRegistry.register
class SimpleTransformerWorkload(MultiGPUMixin, ModelWorkload):
"""
Single-GPU transformer training with multi-stream pipelining.

Splits the model's layers into groups and pipelines the forward pass
across CUDA streams. After the final layer group, loss and backward
run on a dedicated stream, then the optimizer step is issued on the
first stream so it can overlap with the next iteration's data prep.

Stream assignment (for stream_count=4 as example):
- Stream 0: data prep / optimizer step
- Stream 1: forward layers 0-1
- Stream 2: forward layers 2-3
- Stream 3: forward layers 4-5, loss, backward
"""

name = "simple_transformer"
description = "Simple transformer training with pipelined forward pass"
category = "pipeline"
min_streams = 2
max_streams = 16
recommended_streams = 4
switch_latency_sensitivity = "medium"
memory_requirements_gb = 2.0
multi_gpu_capable = True

Copilot AI Feb 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This new workload isn’t covered by the existing workload smoke tests (see tests/hw_queue_eval/test_workloads.py, which runs several registered workloads through StreamHarness). Please add a small test that instantiates simple_transformer with reduced sizes (e.g., fewer layers/shorter seq) and verifies it runs at a couple stream counts and returns positive throughput, so regressions in stream dependency logic are caught early.

Copilot uses AI. Check for mistakes.
Comment thread docs/hw-queue-eval.md
Comment on lines +105 to +108
## Running the Simple Transformer Workload

The `simple_transformer` workload implements a GPT-2-style decoder-only transformer with multi-stream pipelined training. Layers are split into groups, each assigned to a different CUDA stream. The forward pass is pipelined (stream K+1 waits on stream K), loss and backward run on a dedicated stream, and the optimizer step overlaps with the next iteration's data prep.

Copilot AI Feb 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The docs describe this workload as a “GPT-2-style decoder-only transformer” with pipelined training, but the current implementation uses nn.TransformerEncoderLayer without a causal mask (so it isn’t decoder-only). Please either update the documentation wording to match the actual behavior, or update the implementation to enforce causal attention so the GPT-2 comparison is accurate.

Copilot uses AI. Check for mistakes.
" with computation for memory-constrained training."
),
"simple_transformer": (
" GPT-2-style transformer training with pipelined forward pass across\n"

Copilot AI Feb 27, 2026

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CLI purpose text calls this “GPT-2-style” training, but the implementation currently uses nn.TransformerEncoderLayer without causal masking (not decoder-only / GPT-2-like). Consider adjusting this description (and/or the implementation) so CLI help output is accurate and consistent with the workload’s actual attention semantics.

Suggested change
" GPT-2-style transformer training with pipelined forward pass across\n"
" Transformer encoder training workload with pipelined forward pass across\n"

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants