Tiny GPU v0.1.0-alpha: End-to-End Execution Platform by cp024s · Pull Request #60 · adam-maj/tiny-gpu

cp024s · 2026-06-04T07:33:46Z

Tiny GPU v0.1.0-alpha: End-to-End Execution Platform

Overview

This PR establishes the first complete executable Tiny GPU platform.

The repository originally contained many of the fundamental RTL building blocks required for a GPU implementation, but lacked a fully integrated execution flow demonstrating how kernels are loaded, dispatched, executed, and verified end-to-end.

This work focuses on turning the project into a coherent and executable GPU system while preserving the project's educational goals and architectural simplicity.

The result is a functional GPU prototype capable of executing assembly programs through a complete fetch → decode → execute → memory → retire flow.

Objectives

The primary objectives of this milestone were:

Integrate the major RTL subsystems into a complete GPU platform
Establish an executable assembly workflow
Validate instruction execution through automated regression testing
Improve project documentation and onboarding
Create a foundation for future ISA and architectural expansion

Major Additions

GPU Integration

Integrated the major architectural components into a unified GPU top-level design:

Device Control Register (DCR)
Block Dispatcher
Program Memory Controller
Data Memory Controller
Multi-Core Infrastructure
Compute Core Integration

The GPU can now launch kernels, distribute work across cores, and coordinate memory transactions through centralized controllers.

Assembly Toolchain

Expanded the software tooling required to exercise the RTL.

Added:

Assembly workflow
Example assembly programs
Program execution infrastructure

Example programs now cover:

CONST
CMP
ADD
SUB
MUL
DIV
LOAD
STORE
Branching

These programs provide executable examples of the supported instruction set and serve as regression workloads.

Verification Infrastructure

Added and expanded Cocotb-based verification.

Regression coverage now includes:

Core-Level Verification

Arithmetic operations
Branch execution
Memory operations
Program execution

GPU-Level Verification

Kernel launch flow
Dispatch infrastructure
Program memory interfaces
Data memory interfaces
End-to-end execution

The goal of this phase was functional correctness rather than exhaustive architectural verification.

Documentation

Substantially expanded project documentation.

Added:

Architecture Guide
ISA Reference
Verification Guide
Project Status
Development Roadmap
Release Notes

The documentation now describes both the architecture and the rationale behind major design decisions.

Architectural Status

Current implementation supports:

Execution

Multi-core execution
Thread-level execution
Scheduler-driven execution model
Fetch / Decode / Execute pipeline

Memory

Program memory subsystem
Data memory subsystem
Memory arbitration controllers
External memory interfaces

Instruction Set

Implemented instructions:

CONST
CMP
ADD
SUB
MUL
DIV
LOAD
STORE
BRN
BRZ
BRP
RET

Verification Results

Primary regression suites:

pytest -s tb/run_core_programs.py
pytest -s tb/run_gpu_top.py

Current status:

PASS

Verified functionality includes:

Design Philosophy

This project intentionally prioritizes understandability over performance.

Several advanced GPU concepts are intentionally omitted from this release:

Warp switching
Branch divergence handling
SIMD execution
Cache hierarchy
Memory coalescing
Advanced scheduling
Pipeline optimization

The objective is to provide a compact and approachable architecture that demonstrates core GPU execution concepts before introducing production-grade complexity.

Known Limitations

Current limitations include:

Limited ISA functionality
No cache implementation
No architectural scoreboarding
No FPGA deployment flow
Simplified scheduling model
Simplified memory model

These limitations are expected and are tracked in the project roadmap.

Future Work

Planned future milestones include:

v0.2.0

Logic instructions
Shift instructions
ISA expansion

v0.3.0

Architectural scoreboarding
Improved verification

v0.4.0

Warp scheduling enhancements
Occupancy management

Long-Term

Cache hierarchy
SIMD execution
FPGA deployment
Performance analysis

Summary

This PR transitions Tiny GPU from a collection of RTL components into a complete executable GPU platform.

Key outcomes:

Complete GPU integration
Executable assembly workflow
Automated regression infrastructure
Expanded documentation
Stable baseline for future development

This milestone is intended to serve as the foundation for all future ISA, scheduling, memory-system, and verification enhancements.

- Add central architectural parameter definitions - Define warp, register, memory and cache constants - Establish single source of truth for configuration - Prepare repository for warp-based execution model

- Add warp state enumeration - Add thread mask abstraction - Add warp context structure - Establish common architecture-level types - Prepare infrastructure for warp scheduling

- Add ISA opcode definitions - Add scheduler utility functions - Add architectural constants - Centralize shared architecture definitions - Establish foundation for warp execution model

- Define thread and warp abstractions - Document warp lifecycle and scheduler states - Specify active mask behavior - Establish round-robin scheduling policy - Define architectural foundation for SIMT execution

- Add architectural warp context storage - Implement allocation and update interfaces - Track PC, active mask and warp state - Establish foundation for warp scheduling - Prepare for SIMT execution model

- Add warp-based scheduling infrastructure - Implement round robin warp selection - Integrate warp context abstraction - Establish foundation for SIMT execution - Prepare for dispatch integration

- Add centralized storage for active warp contexts - Support warp allocation and context updates - Integrate warp_context instances into table structure - Provide scheduler-visible warp context array - Establish foundation for warp-based execution

- Define system hierarchy - Establish component ownership - Document execution flow - Clarify architectural responsibilities - Define long-term evolution path

- Add warp allocation interface - Generate warp allocation events during block dispatch - Introduce warp ID tracking - Connect dispatcher to warp table infrastructure - Preserve existing block scheduling behavior

- Replace magic state values with architecture enums - Add strong typing for scheduler interfaces - Move LSU wait detection into combinational logic - Convert to modern SystemVerilog style - Improve readability and maintainability - Preserve existing execution behavior

- Replace local fetch states with architecture enums - Add strong typing for scheduler interfaces - Convert to modern SystemVerilog style - Improve readability and maintainability - Preserve existing functionality - Prepare fetch path for instruction cache integration

- Replace magic state values with architecture enums - Convert to modern SystemVerilog style - Simplify NZP register updates - Improve readability and structure - Preserve existing branch behavior - Prepare PC path for future divergence support

- Add lsu_state_t enum - Standardize architectural state definitions - Improve type safety across scheduler and LSU - Preserve existing functionality

- Replace reg/wire with logic - Introduce core_state_t usage - Convert to always_ff - Add divide-by-zero protection - Improve readability and maintainability

- Restore original Tiny GPU ISA decode behavior - Convert decoder to modern SystemVerilog style - Replace raw state values with architecture enums - Add opcode enum definitions - Improve readability and maintainability - Preserve architectural functionality

- Modernize core subsystem - Convert modules to SystemVerilog style - Integrate scheduler/fetch/lsu updates - Clean Verilator lint across core hierarchy - Preserve Adam Maj Tiny GPU ISA behavior

- Add verilator+cocotb infrastructure - Add ALU directed tests - Verify arithmetic operations - Verify compare operations - Verify divide-by-zero behavior - Verify enable gating

- Add register file verification environment - Verify reset behavior - Verify special registers - Verify arithmetic writeback - Verify memory writeback - Verify constant writeback - Verify write protection - Verify enable gating

- Add fetch unit verification environment - Verify reset behavior - Verify fetch request generation - Verify instruction fetch completion - Verify fetch-to-decode transition - Verify wait state behavior - Verify multiple sequential fetches - Verify non-fetch state handling

- Verify reset behavior - Verify idle to fetch transition - Verify fetch to decode transition - Verify pipeline progression without memory access - Verify return handling and done transition

- Verify instruction field extraction - Verify arithmetic instruction decode - Verify compare instruction decode - Verify branch instruction decode - Verify load/store instruction decode - Verify constant instruction decode - Verify return instruction decode

- Add core-level cocotb environment - Verify fetch/decode/scheduler integration - Verify RET instruction execution - Verify program memory interface - Verify done-state progression - Establish subsystem verification framework

- Add reusable core program execution framework - Verify CONST instruction execution - Verify ADD instruction execution - Verify SUB instruction execution - Verify MUL instruction execution - Verify DIV instruction execution - Verify architectural register state

- Add CMP instruction regression - Verify NZP generation - Verify PC NZP register update - Extend core architectural verification

- Verify CMP instruction - Verify NZP generation - Verify branch decode - Verify PC redirection - Verify control flow execution

- Verify CONST instruction - Verify ADD instruction - Verify SUB instruction - Verify MUL instruction - Verify DIV instruction - Verify CMP and NZP updates - Verify branch control flow - Verify load transactions - Verify store transactions - Add multi-thread memory model - Complete core architectural regression

- Add gpu_top cocotb regression - Add program memory model - Add data memory model - Verify DCR configuration - Verify block dispatch - Verify core startup - Verify end-to-end program execution - Verify GPU completion path

- Document GPU top-level architecture - Describe execution pipeline - Document memory subsystem - Document compute core structure - Establish v0.1.0-alpha architecture baseline

cp024s added 30 commits June 1, 2026 14:11

repo: establish Tiny GPU Next architecture foundation

ac1dbdf

arch: define global architectural parameters

1b70d21

- Add central architectural parameter definitions - Define warp, register, memory and cache constants - Establish single source of truth for configuration - Prepare repository for warp-based execution model

arch: introduce common GPU architecture types

274794a

- Add warp state enumeration - Add thread mask abstraction - Add warp context structure - Establish common architecture-level types - Prepare infrastructure for warp scheduling

arch: introduce central GPU package

61af769

- Add ISA opcode definitions - Add scheduler utility functions - Add architectural constants - Centralize shared architecture definitions - Establish foundation for warp execution model

docs: define Tiny GPU Next execution model

705ce48

- Define thread and warp abstractions - Document warp lifecycle and scheduler states - Specify active mask behavior - Establish round-robin scheduling policy - Define architectural foundation for SIMT execution

arch: add warp context management module

0ce0273

- Add architectural warp context storage - Implement allocation and update interfaces - Track PC, active mask and warp state - Establish foundation for warp scheduling - Prepare for SIMT execution model

arch: add round robin warp scheduler

7ef1980

- Add warp-based scheduling infrastructure - Implement round robin warp selection - Integrate warp context abstraction - Establish foundation for SIMT execution - Prepare for dispatch integration

refactor: rename dispatch to block_dispatch

2d6a62f

docs: define Tiny GPU Next system architecture

34ba54a

- Define system hierarchy - Establish component ownership - Document execution flow - Clarify architectural responsibilities - Define long-term evolution path

arch: add warp allocation path to dispatcher

7059e8a

- Add warp allocation interface - Generate warp allocation events during block dispatch - Introduce warp ID tracking - Connect dispatcher to warp table infrastructure - Preserve existing block scheduling behavior

refactor: modernize pc update logic

e45fbde

- Replace magic state values with architecture enums - Convert to modern SystemVerilog style - Simplify NZP register updates - Improve readability and structure - Preserve existing branch behavior - Prepare PC path for future divergence support

refactor: add common lsu state type

b991a26

- Add lsu_state_t enum - Standardize architectural state definitions - Improve type safety across scheduler and LSU - Preserve existing functionality

refactor: modernize alu implementation

1fc0f3a

- Replace reg/wire with logic - Introduce core_state_t usage - Convert to always_ff - Add divide-by-zero protection - Improve readability and maintainability

refactor: simplify lsu state machine

2b36864

RTL: RTL architecture cleanup

3541729

- Modernize core subsystem - Convert modules to SystemVerilog style - Integrate scheduler/fetch/lsu updates - Clean Verilator lint across core hierarchy - Preserve Adam Maj Tiny GPU ISA behavior

rtl: complete sprint1 architecture cleanup

23c94a1

dv: add ALU cocotb regression

85c5ba3

- Add verilator+cocotb infrastructure - Add ALU directed tests - Verify arithmetic operations - Verify compare operations - Verify divide-by-zero behavior - Verify enable gating

dv: add register file cocotb regression

8de6349

- Add register file verification environment - Verify reset behavior - Verify special registers - Verify arithmetic writeback - Verify memory writeback - Verify constant writeback - Verify write protection - Verify enable gating

dv: add scheduler cocotb regression

93b9fe9

- Verify reset behavior - Verify idle to fetch transition - Verify fetch to decode transition - Verify pipeline progression without memory access - Verify return handling and done transition

DV: Add core subsystem regression

c702fd4

- Add core-level cocotb environment - Verify fetch/decode/scheduler integration - Verify RET instruction execution - Verify program memory interface - Verify done-state progression - Establish subsystem verification framework

DV: add cmp and nzp verification

2aed6eb

- Add CMP instruction regression - Verify NZP generation - Verify PC NZP register update - Extend core architectural verification

DV: Add brach control floe verification

b504de5

- Verify CMP instruction - Verify NZP generation - Verify branch decode - Verify PC redirection - Verify control flow execution

cp024s added 4 commits June 2, 2026 22:13

DV: Add GPU top integration regression

8c33659

- Add gpu_top cocotb regression - Add program memory model - Add data memory model - Verify DCR configuration - Verify block dispatch - Verify core startup - Verify end-to-end program execution - Verify GPU completion path

toolchain: add tiny gpu assembler

50d4add

docs: add architecture specification

8845671

- Document GPU top-level architecture - Describe execution pipeline - Document memory subsystem - Document compute core structure - Establish v0.1.0-alpha architecture baseline

feat: deliver tiny-gpu v0.1.0-alpha

fca73fa

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tiny GPU v0.1.0-alpha: End-to-End Execution Platform#60

Tiny GPU v0.1.0-alpha: End-to-End Execution Platform#60
cp024s wants to merge 34 commits into
adam-maj:masterfrom
cp024s:master

cp024s commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cp024s commented Jun 4, 2026

Tiny GPU v0.1.0-alpha: End-to-End Execution Platform

Overview

Objectives

Major Additions

GPU Integration

Assembly Toolchain

Verification Infrastructure

Core-Level Verification

GPU-Level Verification

Documentation

Architectural Status

Execution

Memory

Instruction Set

Verification Results

Design Philosophy

Known Limitations

Future Work

v0.2.0

v0.3.0

v0.4.0

Long-Term

Summary

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant