Add simple_transformer workload for pipelined forward pass evaluation#118
Add simple_transformer workload for pipelined forward pass evaluation#118oyazdanb wants to merge 1 commit into
Conversation
Adds a GPT-2-style decoder-only transformer training workload that pipelines the forward pass across CUDA streams. Each layer group runs on its own stream with explicit inter-stream dependencies, exercising hardware queue switching under a realistic training loop. Classified as P1 priority. Includes documentation with SLURM + Docker usage. Made-with: Cursor
There was a problem hiding this comment.
Pull request overview
Adds a new simple_transformer workload to the HW queue evaluation suite, intended to emulate a GPT-2-like training loop while pipelining the forward pass across multiple CUDA streams to stress hardware queue switching.
Changes:
- Introduces
SimpleTransformerWorkload(decoder-style transformer training workload with multi-stream forward pipelining). - Registers/exports the new workload via the pipeline package and surfaces it as a P1 workload in the CLI.
- Documents how to run the new workload locally and on SLURM+Docker.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 8 comments.
| File | Description |
|---|---|
src/aorta/hw_queue_eval/workloads/pipeline/simple_transformer.py |
Implements the transformer model + pipelined multi-stream training iteration logic and config reporting. |
src/aorta/hw_queue_eval/workloads/pipeline/__init__.py |
Exports the new workload and mentions it in the pipeline module docs. |
src/aorta/hw_queue_eval/cli.py |
Adds the workload to the P1 set and prints its purpose text. |
docs/hw-queue-eval.md |
Adds the workload to the documentation table and provides detailed run instructions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| multi_gpu_capable = True | ||
|
|
||
| def __init__( | ||
| self, | ||
| hidden_size: int = 512, | ||
| num_layers: int = 6, | ||
| num_heads: int = 8, | ||
| batch_size: int = 8, | ||
| seq_length: int = 128, | ||
| vocab_size: int = 32000, | ||
| learning_rate: float = 1e-3, | ||
| use_multi_gpu: bool = True, | ||
| ): | ||
| super().__init__() | ||
| self.hidden_size = hidden_size | ||
| self.num_layers = num_layers | ||
| self.num_heads = num_heads | ||
| self._batch_size = batch_size | ||
| self.seq_length = seq_length | ||
| self.vocab_size = vocab_size | ||
| self.learning_rate = learning_rate | ||
| self.use_multi_gpu = use_multi_gpu | ||
|
|
||
| self._optimizer: Optional[torch.optim.Optimizer] = None | ||
| self._input_ids: Optional[torch.Tensor] = None | ||
| self._labels: Optional[torch.Tensor] = None | ||
| self._loss_fn: Optional[nn.Module] = None | ||
|
|
||
| self._layer_groups: List[tuple] = [] | ||
| self._fwd_streams: List[int] = [] | ||
| self._data_stream: int = 0 | ||
| self._loss_stream: int = 0 | ||
| self._devices: List[str] = [] | ||
| self._stream_to_device: Dict[int, str] = {} | ||
|
|
||
| def setup(self, stream_count: int, device: str = "cuda:0") -> None: | ||
| self._stream_count = stream_count | ||
| self._is_setup = True | ||
|
|
||
| self._setup_multi_gpu(stream_count, device, self.use_multi_gpu) | ||
|
|
||
| primary_device = self._get_device_for_stream(0) | ||
|
|
||
| self._model = SimpleTransformerModel( | ||
| vocab_size=self.vocab_size, | ||
| hidden_size=self.hidden_size, | ||
| num_layers=self.num_layers, | ||
| num_heads=self.num_heads, | ||
| max_seq_len=self.seq_length, | ||
| ).to(primary_device) | ||
| self._model.train() | ||
|
|
There was a problem hiding this comment.
multi_gpu_capable = True / use_multi_gpu=True but the implementation keeps the model and all tensors on primary_device = _get_device_for_stream(0) and then runs work on arbitrary streams[...]. When the harness is configured for multi-GPU stream distribution (default in HarnessConfig), some of those streams will be created on other devices, causing device-mismatch errors and wait_stream not being valid across devices. Either mark this workload as single-GPU (drop MultiGPUMixin, set multi_gpu_capable=False, default use_multi_gpu=False) or explicitly move/replicate tensors per stream device and handle inter-device transfers/events like the moe workload does.
| seq_len = self._input_ids.size(1) | ||
| positions = torch.arange(seq_len, device=self._input_ids.device) | ||
|
|
||
| prev_stream = data_stream | ||
| hidden = None | ||
|
|
||
| for group_idx, (layer_start, layer_end) in enumerate(self._layer_groups): | ||
| fwd_idx = self._fwd_streams[group_idx % len(self._fwd_streams)] | ||
| fwd_stream = streams[fwd_idx] | ||
|
|
||
| fwd_stream.wait_stream(prev_stream) | ||
|
|
||
| with torch.cuda.stream(fwd_stream): | ||
| if group_idx == 0: | ||
| hidden = ( | ||
| model.embed_tokens(self._input_ids) | ||
| + model.embed_pos(positions) | ||
| ) | ||
| hidden = model.forward_layers(hidden, layer_start, layer_end) |
There was a problem hiding this comment.
positions = torch.arange(...) is created outside any torch.cuda.stream(...) context, so it is enqueued on the default stream for the device. It’s then consumed inside with torch.cuda.stream(fwd_stream): without any dependency on the default stream, which can introduce a cross-stream race. Consider precomputing/caching a positions tensor in setup() (or creating it inside the first forward stream after wait_stream) so its producer stream is properly ordered before use, and to avoid per-iteration allocations.
| class SimpleTransformerModel(nn.Module): | ||
| """Small GPT-2-style decoder-only transformer for single-GPU training.""" | ||
|
|
||
| def __init__( | ||
| self, | ||
| vocab_size: int = 32000, | ||
| hidden_size: int = 512, | ||
| num_layers: int = 6, | ||
| num_heads: int = 8, | ||
| max_seq_len: int = 256, | ||
| dropout: float = 0.0, | ||
| ): | ||
| super().__init__() | ||
| self.hidden_size = hidden_size | ||
| self.num_layers = num_layers | ||
|
|
||
| self.embed_tokens = nn.Embedding(vocab_size, hidden_size) | ||
| self.embed_pos = nn.Embedding(max_seq_len, hidden_size) | ||
|
|
||
| self.layers = nn.ModuleList([ | ||
| nn.TransformerEncoderLayer( | ||
| d_model=hidden_size, | ||
| nhead=num_heads, | ||
| dim_feedforward=hidden_size * 4, | ||
| dropout=dropout, | ||
| batch_first=True, | ||
| ) | ||
| for _ in range(num_layers) | ||
| ]) |
There was a problem hiding this comment.
The model is described as “GPT-2-style decoder-only”, but it uses nn.TransformerEncoderLayer with no causal mask (layer(x)), which allows attending to future tokens and is not decoder-only behavior. Either apply a causal mask / is_causal in the layer forward (or switch to a decoder-style block) or adjust the module/workload/docs wording so it doesn’t claim GPT-2/decoder-only semantics.
| # --- Optimizer step on stream 0, overlapping with next data prep --- | ||
| data_stream.wait_stream(loss_stream) | ||
|
|
||
| with torch.cuda.stream(data_stream): | ||
| self._optimizer.step() | ||
| self._optimizer.zero_grad() |
There was a problem hiding this comment.
The comments claim the optimizer step overlaps with the next iteration’s data prep, but both are issued onto data_stream and the next iteration’s forward explicitly waits on data_stream, so there’s no opportunity for overlap—everything on stream 0 is serialized and gates the forward. Either update the description to match the actual scheduling, or move the optimizer step to a separate stream / reorder dependencies to create real overlap if that’s a goal of this workload.
| hidden = model.ln_f(hidden) | ||
| logits = model.lm_head(hidden) | ||
| loss = self._loss_fn( | ||
| logits.view(-1, self.vocab_size), self._labels.view(-1) |
There was a problem hiding this comment.
logits.view(-1, self.vocab_size) assumes logits is contiguous; if it isn’t (e.g., due to upstream layout changes), .view will raise at runtime. Using .reshape(-1, self.vocab_size) (or .flatten(0, 1) for the first two dims) avoids this fragility without changing semantics.
| logits.view(-1, self.vocab_size), self._labels.view(-1) | |
| logits.reshape(-1, self.vocab_size), self._labels.reshape(-1) |
| @WorkloadRegistry.register | ||
| class SimpleTransformerWorkload(MultiGPUMixin, ModelWorkload): | ||
| """ | ||
| Single-GPU transformer training with multi-stream pipelining. | ||
|
|
||
| Splits the model's layers into groups and pipelines the forward pass | ||
| across CUDA streams. After the final layer group, loss and backward | ||
| run on a dedicated stream, then the optimizer step is issued on the | ||
| first stream so it can overlap with the next iteration's data prep. | ||
|
|
||
| Stream assignment (for stream_count=4 as example): | ||
| - Stream 0: data prep / optimizer step | ||
| - Stream 1: forward layers 0-1 | ||
| - Stream 2: forward layers 2-3 | ||
| - Stream 3: forward layers 4-5, loss, backward | ||
| """ | ||
|
|
||
| name = "simple_transformer" | ||
| description = "Simple transformer training with pipelined forward pass" | ||
| category = "pipeline" | ||
| min_streams = 2 | ||
| max_streams = 16 | ||
| recommended_streams = 4 | ||
| switch_latency_sensitivity = "medium" | ||
| memory_requirements_gb = 2.0 | ||
| multi_gpu_capable = True | ||
|
|
There was a problem hiding this comment.
This new workload isn’t covered by the existing workload smoke tests (see tests/hw_queue_eval/test_workloads.py, which runs several registered workloads through StreamHarness). Please add a small test that instantiates simple_transformer with reduced sizes (e.g., fewer layers/shorter seq) and verifies it runs at a couple stream counts and returns positive throughput, so regressions in stream dependency logic are caught early.
| ## Running the Simple Transformer Workload | ||
|
|
||
| The `simple_transformer` workload implements a GPT-2-style decoder-only transformer with multi-stream pipelined training. Layers are split into groups, each assigned to a different CUDA stream. The forward pass is pipelined (stream K+1 waits on stream K), loss and backward run on a dedicated stream, and the optimizer step overlaps with the next iteration's data prep. | ||
|
|
There was a problem hiding this comment.
The docs describe this workload as a “GPT-2-style decoder-only transformer” with pipelined training, but the current implementation uses nn.TransformerEncoderLayer without a causal mask (so it isn’t decoder-only). Please either update the documentation wording to match the actual behavior, or update the implementation to enforce causal attention so the GPT-2 comparison is accurate.
| " with computation for memory-constrained training." | ||
| ), | ||
| "simple_transformer": ( | ||
| " GPT-2-style transformer training with pipelined forward pass across\n" |
There was a problem hiding this comment.
CLI purpose text calls this “GPT-2-style” training, but the implementation currently uses nn.TransformerEncoderLayer without causal masking (not decoder-only / GPT-2-like). Consider adjusting this description (and/or the implementation) so CLI help output is accurate and consistent with the workload’s actual attention semantics.
| " GPT-2-style transformer training with pipelined forward pass across\n" | |
| " Transformer encoder training workload with pipelined forward pass across\n" |
Adds a GPT-2-style decoder-only transformer training workload that pipelines the forward pass across CUDA streams. Each layer group runs on its own stream with explicit inter-stream dependencies, exercising hardware queue switching under a realistic training loop. Classified as P1 priority. Includes documentation with SLURM + Docker usage.