Skip to content

GrigoryEvko/vramfs-rs

Repository files navigation

vramfs-rs

A filesystem that stores data in GPU VRAM instead of system RAM.

Rust + CUDA. Error correction runs entirely on the GPU as custom kernels -- no CPU fallback, no host-side Reed-Solomon.

Why

GPUs have 8-24GB of memory that mostly goes unused outside of rendering and ML. VRAM bandwidth is 900+ GB/s on Ada Lovelace vs ~50 GB/s for DDR5. vramfs exposes that memory as a block device with Reed-Solomon error correction -- scratch space, temp storage, anything where you want throughput and don't need persistence.

How it works

Host (CPU)                          Device (GPU VRAM)
┌──────────┐   PCIe DMA    ┌──────────────────────────────┐
│  FUSE    │◄─────────────►│  [block 0][block 1]...[N]    │
│  layer   │   async H2D   │  [ecc 0  ][ecc 1  ]...[N]   │
│  (WIP)   │   async D2H   │                              │
└──────────┘               │  Stream Pool (per-thread)    │
                           │  RS Encode/Decode Kernels    │
                           └──────────────────────────────┘

Memory layout. A single cuMemAlloc call grabs the entire region upfront. Data blocks (128KB each) occupy the front; ECC parity (8KB per block, 6.25% overhead) occupies the tail. No fragmentation, no runtime allocation.

Block pool. Lock-free concurrent queue (crossbeam::SegQueue). Allocate = pop, free = push. ~100ns per op. RAII handles (BlockHandle) free blocks on drop.

Stream pool. Queries SM count and DMA engine count at init, pre-allocates CUDA streams accordingly. Per-thread caching via thread_local! (~5ns after first hit). Stream pairs let you overlap compute and DMA.

Reed-Solomon ECC

RS(128, 120) over GF(256). Each 128-byte cache line carries 120 data bytes and 8 parity bytes. Corrects up to 4 byte errors, detects up to 8.

Split between build-time Rust (table generation) and runtime CUDA (kernels):

Build time (build.rs):

  • Generates GF(256) exp/log/inverse tables from the primitive polynomial x^8 + x^4 + x^3 + x^2 + 1
  • Computes the RS generator polynomial g(x) = (x - α^1)(x - α^2)...(x - α^8)
  • Inverts the 8x8 Vandermonde matrix for the parallel encoder
  • Precomputes alpha powers for Estrin's polynomial evaluation scheme
  • Writes it all as __constant__ CUDA headers, compiles to PTX with nvcc

Runtime (CUDA kernels):

Kernel What it does
rs_encode_batch_parallel Warp-parallel encoder. 32 threads per cache line. Vectorized async loads, Horner + Estrin evaluation, warp shuffle matrix multiply. 86-95% ALU utilization.
rs_decode_kernel_parallel Warp-parallel decoder. Vectorized syndrome computation, Berlekamp-Massey, parallel Chien search with __ballot_sync, parallel Forney. Corrects both data and parity.
rs_verify_and_refresh_batch_fused Single-kernel verify + re-encode. Loads data once, corrects in shared memory, emits fresh parity. 30-40% faster than separate decode + encode.

GF(256) arithmetic is computed inline (Russian Peasant multiplication, Fermat's little theorem for inversion) -- no lookup tables, no shared memory loads. This killed 96% of MIO stalls that the table-based approach had (measured with NCU).

Building

You need:

  • Rust 1.70+
  • NVIDIA CUDA Toolkit 12.0+ (nvcc on PATH)
  • NVIDIA GPU (defaults to sm_89 / Ada Lovelace, works down to CC 3.0)
cargo build --release

build.rs generates GF(256) tables in Rust, writes them as CUDA headers, compiles ecc_kernels.cu to PTX with nvcc, and embeds the PTX in the binary via include_str!. To target a different GPU arch, change -arch=sm_89 in build.rs.

Usage

# Test GPU memory allocation and block pool
vramfs test --size 1G --ecc

# Test with specific GPU
vramfs test --size 512M --ecc --device 1

FUSE mounting (vramfs mount) is not yet implemented.

Running tests

# Unit and integration tests (requires a CUDA-capable GPU)
cargo test --release

# Extended fuzzing tests (longer runs, stress tests)
cargo test --release --features extended-tests

# Benchmarks
cargo bench

Profiling

# Build with NVTX instrumentation (zero overhead when disabled)
cargo build --release --features profile

# Profile with Nsight Systems
nsys profile --trace=cuda,nvtx ./target/release/profile_level4_pipeline

# Profile with Nsight Compute (kernel-level)
ncu --set full ./target/release/profile_level4_encoders

Dedicated profiling binaries: profile_level3_stream_pool, profile_level4_decoder, profile_level4_encoders, profile_level4_pipeline, profile_level4_gpu_parallelism.

Project structure

src/
├── main.rs                    CLI entry point
├── lib.rs                     Public API re-exports
├── gpu/
│   ├── region.rs              GPU memory region (single contiguous alloc)
│   ├── pool.rs                Lock-free block allocator
│   ├── handle.rs              RAII block handle with H2D/D2H transfers
│   ├── stream_pool.rs         CUDA stream pool
│   ├── ecc.rs                 GPU ECC engine (kernel dispatch)
│   ├── ecc_engine.rs          EccEngine trait
│   ├── ecc_kernels.cu         CUDA kernel entry point
│   ├── galois_field_computed.cuh   GF(256) computed arithmetic
│   ├── rs_encoder_universal.cuh    Warp-parallel RS encoder
│   ├── rs_decoder_universal.cuh    Warp-parallel RS decoder
│   ├── rs_fused_verify_refresh.cuh Fused verify + re-encode
│   ├── gf256_gen.rs           Build-time table generation
│   ├── constants.rs           Block sizes, RS parameters
│   └── profiling.rs           NVTX instrumentation
├── bin/                       Profiling binaries
build.rs                       GF(256) codegen + nvcc compilation
tests/                         ECC correctness, fuzzing, edge cases
benches/                       Criterion benchmarks

Status

GPU memory management and ECC work. FUSE filesystem layer is next.

  • Done: memory allocation, block pool, stream pool, RS encode/decode/verify, fused verify+refresh, NVTX profiling, tests
  • Next: FUSE integration, file/directory metadata, read/write syscall path

License

MIT

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors