Skip to content

Tiny GPU v0.1.0-alpha: End-to-End Execution Platform#60

Open
cp024s wants to merge 34 commits into
adam-maj:masterfrom
cp024s:master
Open

Tiny GPU v0.1.0-alpha: End-to-End Execution Platform#60
cp024s wants to merge 34 commits into
adam-maj:masterfrom
cp024s:master

Conversation

@cp024s
Copy link
Copy Markdown

@cp024s cp024s commented Jun 4, 2026

Tiny GPU v0.1.0-alpha: End-to-End Execution Platform

Overview

This PR establishes the first complete executable Tiny GPU platform.

The repository originally contained many of the fundamental RTL building blocks required for a GPU implementation, but lacked a fully integrated execution flow demonstrating how kernels are loaded, dispatched, executed, and verified end-to-end.

This work focuses on turning the project into a coherent and executable GPU system while preserving the project's educational goals and architectural simplicity.

The result is a functional GPU prototype capable of executing assembly programs through a complete fetch → decode → execute → memory → retire flow.


Objectives

The primary objectives of this milestone were:

  • Integrate the major RTL subsystems into a complete GPU platform

  • Establish an executable assembly workflow

  • Validate instruction execution through automated regression testing

  • Improve project documentation and onboarding

  • Create a foundation for future ISA and architectural expansion


Major Additions

GPU Integration

Integrated the major architectural components into a unified GPU top-level design:

  • Device Control Register (DCR)

  • Block Dispatcher

  • Program Memory Controller

  • Data Memory Controller

  • Multi-Core Infrastructure

  • Compute Core Integration

The GPU can now launch kernels, distribute work across cores, and coordinate memory transactions through centralized controllers.


Assembly Toolchain

Expanded the software tooling required to exercise the RTL.

Added:

  • Assembly workflow

  • Example assembly programs

  • Program execution infrastructure

Example programs now cover:

  • CONST

  • CMP

  • ADD

  • SUB

  • MUL

  • DIV

  • LOAD

  • STORE

  • Branching

These programs provide executable examples of the supported instruction set and serve as regression workloads.


Verification Infrastructure

Added and expanded Cocotb-based verification.

Regression coverage now includes:

Core-Level Verification

  • Arithmetic operations

  • Branch execution

  • Memory operations

  • Program execution

GPU-Level Verification

  • Kernel launch flow

  • Dispatch infrastructure

  • Program memory interfaces

  • Data memory interfaces

  • End-to-end execution

The goal of this phase was functional correctness rather than exhaustive architectural verification.


Documentation

Substantially expanded project documentation.

Added:

  • Architecture Guide

  • ISA Reference

  • Verification Guide

  • Project Status

  • Development Roadmap

  • Release Notes

The documentation now describes both the architecture and the rationale behind major design decisions.


Architectural Status

Current implementation supports:

Execution

  • Multi-core execution

  • Thread-level execution

  • Scheduler-driven execution model

  • Fetch / Decode / Execute pipeline

Memory

  • Program memory subsystem

  • Data memory subsystem

  • Memory arbitration controllers

  • External memory interfaces

Instruction Set

Implemented instructions:

  • CONST

  • CMP

  • ADD

  • SUB

  • MUL

  • DIV

  • LOAD

  • STORE

  • BRN

  • BRZ

  • BRP

  • RET


Verification Results

Primary regression suites:

pytest -s tb/run_core_programs.py
pytest -s tb/run_gpu_top.py

Current status:

PASS

Verified functionality includes:

Feature | Status -- | -- Arithmetic Execution | PASS Branch Execution | PASS Load Operations | PASS Store Operations | PASS Program Execution | PASS GPU Integration | PASS

Design Philosophy

This project intentionally prioritizes understandability over performance.

Several advanced GPU concepts are intentionally omitted from this release:

  • Warp switching

  • Branch divergence handling

  • SIMD execution

  • Cache hierarchy

  • Memory coalescing

  • Advanced scheduling

  • Pipeline optimization

The objective is to provide a compact and approachable architecture that demonstrates core GPU execution concepts before introducing production-grade complexity.


Known Limitations

Current limitations include:

  • Limited ISA functionality

  • No cache implementation

  • No architectural scoreboarding

  • No FPGA deployment flow

  • Simplified scheduling model

  • Simplified memory model

These limitations are expected and are tracked in the project roadmap.


Future Work

Planned future milestones include:

v0.2.0

  • Logic instructions

  • Shift instructions

  • ISA expansion

v0.3.0

  • Architectural scoreboarding

  • Improved verification

v0.4.0

  • Warp scheduling enhancements

  • Occupancy management

Long-Term

  • Cache hierarchy

  • SIMD execution

  • FPGA deployment

  • Performance analysis


Summary

This PR transitions Tiny GPU from a collection of RTL components into a complete executable GPU platform.

Key outcomes:

  • Complete GPU integration

  • Executable assembly workflow

  • Automated regression infrastructure

  • Expanded documentation

  • Stable baseline for future development

This milestone is intended to serve as the foundation for all future ISA, scheduling, memory-system, and verification enhancements.

cp024s added 30 commits June 1, 2026 14:11
- Add central architectural parameter definitions
- Define warp, register, memory and cache constants
- Establish single source of truth for configuration
- Prepare repository for warp-based execution model
- Add warp state enumeration
- Add thread mask abstraction
- Add warp context structure
- Establish common architecture-level types
- Prepare infrastructure for warp scheduling
- Add ISA opcode definitions
- Add scheduler utility functions
- Add architectural constants
- Centralize shared architecture definitions
- Establish foundation for warp execution model
- Define thread and warp abstractions
- Document warp lifecycle and scheduler states
- Specify active mask behavior
- Establish round-robin scheduling policy
- Define architectural foundation for SIMT execution
- Add architectural warp context storage
- Implement allocation and update interfaces
- Track PC, active mask and warp state
- Establish foundation for warp scheduling
- Prepare for SIMT execution model
- Add warp-based scheduling infrastructure
- Implement round robin warp selection
- Integrate warp context abstraction
- Establish foundation for SIMT execution
- Prepare for dispatch integration
- Add centralized storage for active warp contexts
- Support warp allocation and context updates
- Integrate warp_context instances into table structure
- Provide scheduler-visible warp context array
- Establish foundation for warp-based execution
- Define system hierarchy
- Establish component ownership
- Document execution flow
- Clarify architectural responsibilities
- Define long-term evolution path
- Add warp allocation interface
- Generate warp allocation events during block dispatch
- Introduce warp ID tracking
- Connect dispatcher to warp table infrastructure
- Preserve existing block scheduling behavior
- Replace magic state values with architecture enums
- Add strong typing for scheduler interfaces
- Move LSU wait detection into combinational logic
- Convert to modern SystemVerilog style
- Improve readability and maintainability
- Preserve existing execution behavior
- Replace local fetch states with architecture enums
- Add strong typing for scheduler interfaces
- Convert to modern SystemVerilog style
- Improve readability and maintainability
- Preserve existing functionality
- Prepare fetch path for instruction cache integration
- Replace magic state values with architecture enums
- Convert to modern SystemVerilog style
- Simplify NZP register updates
- Improve readability and structure
- Preserve existing branch behavior
- Prepare PC path for future divergence support
- Add lsu_state_t enum
- Standardize architectural state definitions
- Improve type safety across scheduler and LSU
- Preserve existing functionality
- Replace reg/wire with logic
- Introduce core_state_t usage
- Convert to always_ff
- Add divide-by-zero protection
- Improve readability and maintainability
- Restore original Tiny GPU ISA decode behavior
- Convert decoder to modern SystemVerilog style
- Replace raw state values with architecture enums
- Add opcode enum definitions
- Improve readability and maintainability
- Preserve architectural functionality
- Modernize core subsystem
- Convert modules to SystemVerilog style
- Integrate scheduler/fetch/lsu updates
- Clean Verilator lint across core hierarchy
- Preserve Adam Maj Tiny GPU ISA behavior
- Add verilator+cocotb infrastructure
- Add ALU directed tests
- Verify arithmetic operations
- Verify compare operations
- Verify divide-by-zero behavior
- Verify enable gating
- Add register file verification environment
- Verify reset behavior
- Verify special registers
- Verify arithmetic writeback
- Verify memory writeback
- Verify constant writeback
- Verify write protection
- Verify enable gating
- Add fetch unit verification environment
- Verify reset behavior
- Verify fetch request generation
- Verify instruction fetch completion
- Verify fetch-to-decode transition
- Verify wait state behavior
- Verify multiple sequential fetches
- Verify non-fetch state handling
- Verify reset behavior
- Verify idle to fetch transition
- Verify fetch to decode transition
- Verify pipeline progression without memory access
- Verify return handling and done transition
- Verify instruction field extraction
- Verify arithmetic instruction decode
- Verify compare instruction decode
- Verify branch instruction decode
- Verify load/store instruction decode
- Verify constant instruction decode
- Verify return instruction decode
- Add core-level cocotb environment
- Verify fetch/decode/scheduler integration
- Verify RET instruction execution
- Verify program memory interface
- Verify done-state progression
- Establish subsystem verification framework
- Add reusable core program execution framework
- Verify CONST instruction execution
- Verify ADD instruction execution
- Verify SUB instruction execution
- Verify MUL instruction execution
- Verify DIV instruction execution
- Verify architectural register state
- Add CMP instruction regression
- Verify NZP generation
- Verify PC NZP register update
- Extend core architectural verification
- Verify CMP instruction
- Verify NZP generation
- Verify branch decode
- Verify PC redirection
- Verify control flow execution
- Verify CONST instruction
- Verify ADD instruction
- Verify SUB instruction
- Verify MUL instruction
- Verify DIV instruction
- Verify CMP and NZP updates
- Verify branch control flow
- Verify load transactions
- Verify store transactions
- Add multi-thread memory model
- Complete core architectural regression
cp024s added 4 commits June 2, 2026 22:13
- Add gpu_top cocotb regression
- Add program memory model
- Add data memory model
- Verify DCR configuration
- Verify block dispatch
- Verify core startup
- Verify end-to-end program execution
- Verify GPU completion path
- Document GPU top-level architecture
- Describe execution pipeline
- Document memory subsystem
- Document compute core structure
- Establish v0.1.0-alpha architecture baseline
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant