Tiny GPU v0.1.0-alpha: End-to-End Execution Platform#60
Open
cp024s wants to merge 34 commits into
Open
Conversation
- Add central architectural parameter definitions - Define warp, register, memory and cache constants - Establish single source of truth for configuration - Prepare repository for warp-based execution model
- Add warp state enumeration - Add thread mask abstraction - Add warp context structure - Establish common architecture-level types - Prepare infrastructure for warp scheduling
- Add ISA opcode definitions - Add scheduler utility functions - Add architectural constants - Centralize shared architecture definitions - Establish foundation for warp execution model
- Define thread and warp abstractions - Document warp lifecycle and scheduler states - Specify active mask behavior - Establish round-robin scheduling policy - Define architectural foundation for SIMT execution
- Add architectural warp context storage - Implement allocation and update interfaces - Track PC, active mask and warp state - Establish foundation for warp scheduling - Prepare for SIMT execution model
- Add warp-based scheduling infrastructure - Implement round robin warp selection - Integrate warp context abstraction - Establish foundation for SIMT execution - Prepare for dispatch integration
- Add centralized storage for active warp contexts - Support warp allocation and context updates - Integrate warp_context instances into table structure - Provide scheduler-visible warp context array - Establish foundation for warp-based execution
- Define system hierarchy - Establish component ownership - Document execution flow - Clarify architectural responsibilities - Define long-term evolution path
- Add warp allocation interface - Generate warp allocation events during block dispatch - Introduce warp ID tracking - Connect dispatcher to warp table infrastructure - Preserve existing block scheduling behavior
- Replace magic state values with architecture enums - Add strong typing for scheduler interfaces - Move LSU wait detection into combinational logic - Convert to modern SystemVerilog style - Improve readability and maintainability - Preserve existing execution behavior
- Replace local fetch states with architecture enums - Add strong typing for scheduler interfaces - Convert to modern SystemVerilog style - Improve readability and maintainability - Preserve existing functionality - Prepare fetch path for instruction cache integration
- Replace magic state values with architecture enums - Convert to modern SystemVerilog style - Simplify NZP register updates - Improve readability and structure - Preserve existing branch behavior - Prepare PC path for future divergence support
- Add lsu_state_t enum - Standardize architectural state definitions - Improve type safety across scheduler and LSU - Preserve existing functionality
- Replace reg/wire with logic - Introduce core_state_t usage - Convert to always_ff - Add divide-by-zero protection - Improve readability and maintainability
- Restore original Tiny GPU ISA decode behavior - Convert decoder to modern SystemVerilog style - Replace raw state values with architecture enums - Add opcode enum definitions - Improve readability and maintainability - Preserve architectural functionality
- Modernize core subsystem - Convert modules to SystemVerilog style - Integrate scheduler/fetch/lsu updates - Clean Verilator lint across core hierarchy - Preserve Adam Maj Tiny GPU ISA behavior
- Add verilator+cocotb infrastructure - Add ALU directed tests - Verify arithmetic operations - Verify compare operations - Verify divide-by-zero behavior - Verify enable gating
- Add register file verification environment - Verify reset behavior - Verify special registers - Verify arithmetic writeback - Verify memory writeback - Verify constant writeback - Verify write protection - Verify enable gating
- Add fetch unit verification environment - Verify reset behavior - Verify fetch request generation - Verify instruction fetch completion - Verify fetch-to-decode transition - Verify wait state behavior - Verify multiple sequential fetches - Verify non-fetch state handling
- Verify reset behavior - Verify idle to fetch transition - Verify fetch to decode transition - Verify pipeline progression without memory access - Verify return handling and done transition
- Verify instruction field extraction - Verify arithmetic instruction decode - Verify compare instruction decode - Verify branch instruction decode - Verify load/store instruction decode - Verify constant instruction decode - Verify return instruction decode
- Add core-level cocotb environment - Verify fetch/decode/scheduler integration - Verify RET instruction execution - Verify program memory interface - Verify done-state progression - Establish subsystem verification framework
- Add reusable core program execution framework - Verify CONST instruction execution - Verify ADD instruction execution - Verify SUB instruction execution - Verify MUL instruction execution - Verify DIV instruction execution - Verify architectural register state
- Add CMP instruction regression - Verify NZP generation - Verify PC NZP register update - Extend core architectural verification
- Verify CMP instruction - Verify NZP generation - Verify branch decode - Verify PC redirection - Verify control flow execution
- Verify CONST instruction - Verify ADD instruction - Verify SUB instruction - Verify MUL instruction - Verify DIV instruction - Verify CMP and NZP updates - Verify branch control flow - Verify load transactions - Verify store transactions - Add multi-thread memory model - Complete core architectural regression
- Add gpu_top cocotb regression - Add program memory model - Add data memory model - Verify DCR configuration - Verify block dispatch - Verify core startup - Verify end-to-end program execution - Verify GPU completion path
- Document GPU top-level architecture - Describe execution pipeline - Document memory subsystem - Document compute core structure - Establish v0.1.0-alpha architecture baseline
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Tiny GPU v0.1.0-alpha: End-to-End Execution Platform
Overview
This PR establishes the first complete executable Tiny GPU platform.
The repository originally contained many of the fundamental RTL building blocks required for a GPU implementation, but lacked a fully integrated execution flow demonstrating how kernels are loaded, dispatched, executed, and verified end-to-end.
This work focuses on turning the project into a coherent and executable GPU system while preserving the project's educational goals and architectural simplicity.
The result is a functional GPU prototype capable of executing assembly programs through a complete fetch → decode → execute → memory → retire flow.
Objectives
The primary objectives of this milestone were:
Integrate the major RTL subsystems into a complete GPU platform
Establish an executable assembly workflow
Validate instruction execution through automated regression testing
Improve project documentation and onboarding
Create a foundation for future ISA and architectural expansion
Major Additions
GPU Integration
Integrated the major architectural components into a unified GPU top-level design:
Device Control Register (DCR)
Block Dispatcher
Program Memory Controller
Data Memory Controller
Multi-Core Infrastructure
Compute Core Integration
The GPU can now launch kernels, distribute work across cores, and coordinate memory transactions through centralized controllers.
Assembly Toolchain
Expanded the software tooling required to exercise the RTL.
Added:
Assembly workflow
Example assembly programs
Program execution infrastructure
Example programs now cover:
CONST
CMP
ADD
SUB
MUL
DIV
LOAD
STORE
Branching
These programs provide executable examples of the supported instruction set and serve as regression workloads.
Verification Infrastructure
Added and expanded Cocotb-based verification.
Regression coverage now includes:
Core-Level Verification
Arithmetic operations
Branch execution
Memory operations
Program execution
GPU-Level Verification
Kernel launch flow
Dispatch infrastructure
Program memory interfaces
Data memory interfaces
End-to-end execution
The goal of this phase was functional correctness rather than exhaustive architectural verification.
Documentation
Substantially expanded project documentation.
Added:
Architecture Guide
ISA Reference
Verification Guide
Project Status
Development Roadmap
Release Notes
The documentation now describes both the architecture and the rationale behind major design decisions.
Architectural Status
Current implementation supports:
Execution
Multi-core execution
Thread-level execution
Scheduler-driven execution model
Fetch / Decode / Execute pipeline
Memory
Program memory subsystem
Data memory subsystem
Memory arbitration controllers
External memory interfaces
Instruction Set
Implemented instructions:
CONST
CMP
ADD
SUB
MUL
DIV
LOAD
STORE
BRN
BRZ
BRP
RET
Verification Results
Primary regression suites:
Current status:
Verified functionality includes:
Feature | Status -- | -- Arithmetic Execution | PASS Branch Execution | PASS Load Operations | PASS Store Operations | PASS Program Execution | PASS GPU Integration | PASSDesign Philosophy
This project intentionally prioritizes understandability over performance.
Several advanced GPU concepts are intentionally omitted from this release:
Warp switching
Branch divergence handling
SIMD execution
Cache hierarchy
Memory coalescing
Advanced scheduling
Pipeline optimization
The objective is to provide a compact and approachable architecture that demonstrates core GPU execution concepts before introducing production-grade complexity.
Known Limitations
Current limitations include:
Limited ISA functionality
No cache implementation
No architectural scoreboarding
No FPGA deployment flow
Simplified scheduling model
Simplified memory model
These limitations are expected and are tracked in the project roadmap.
Future Work
Planned future milestones include:
v0.2.0
Logic instructions
Shift instructions
ISA expansion
v0.3.0
Architectural scoreboarding
Improved verification
v0.4.0
Warp scheduling enhancements
Occupancy management
Long-Term
Cache hierarchy
SIMD execution
FPGA deployment
Performance analysis
Summary
This PR transitions Tiny GPU from a collection of RTL components into a complete executable GPU platform.
Key outcomes:
Complete GPU integration
Executable assembly workflow
Automated regression infrastructure
Expanded documentation
Stable baseline for future development
This milestone is intended to serve as the foundation for all future ISA, scheduling, memory-system, and verification enhancements.