vramfs-rs

A filesystem that stores data in GPU VRAM instead of system RAM.

Rust + CUDA. Error correction runs entirely on the GPU as custom kernels -- no CPU fallback, no host-side Reed-Solomon.

Why

GPUs have 8-24GB of memory that mostly goes unused outside of rendering and ML. VRAM bandwidth is 900+ GB/s on Ada Lovelace vs ~50 GB/s for DDR5. vramfs exposes that memory as a block device with Reed-Solomon error correction -- scratch space, temp storage, anything where you want throughput and don't need persistence.

How it works

Host (CPU)                          Device (GPU VRAM)
┌──────────┐   PCIe DMA    ┌──────────────────────────────┐
│  FUSE    │◄─────────────►│  [block 0][block 1]...[N]    │
│  layer   │   async H2D   │  [ecc 0  ][ecc 1  ]...[N]   │
│  (WIP)   │   async D2H   │                              │
└──────────┘               │  Stream Pool (per-thread)    │
                           │  RS Encode/Decode Kernels    │
                           └──────────────────────────────┘

Memory layout. A single cuMemAlloc call grabs the entire region upfront. Data blocks (128KB each) occupy the front; ECC parity (8KB per block, 6.25% overhead) occupies the tail. No fragmentation, no runtime allocation.

Block pool. Lock-free concurrent queue (crossbeam::SegQueue). Allocate = pop, free = push. ~100ns per op. RAII handles (BlockHandle) free blocks on drop.

Stream pool. Queries SM count and DMA engine count at init, pre-allocates CUDA streams accordingly. Per-thread caching via thread_local! (~5ns after first hit). Stream pairs let you overlap compute and DMA.

Reed-Solomon ECC

RS(128, 120) over GF(256). Each 128-byte cache line carries 120 data bytes and 8 parity bytes. Corrects up to 4 byte errors, detects up to 8.

Split between build-time Rust (table generation) and runtime CUDA (kernels):

Build time (build.rs):

Generates GF(256) exp/log/inverse tables from the primitive polynomial x^8 + x^4 + x^3 + x^2 + 1
Computes the RS generator polynomial g(x) = (x - α^1)(x - α^2)...(x - α^8)
Inverts the 8x8 Vandermonde matrix for the parallel encoder
Precomputes alpha powers for Estrin's polynomial evaluation scheme
Writes it all as __constant__ CUDA headers, compiles to PTX with nvcc

Runtime (CUDA kernels):

Kernel	What it does
`rs_encode_batch_parallel`	Warp-parallel encoder. 32 threads per cache line. Vectorized async loads, Horner + Estrin evaluation, warp shuffle matrix multiply. 86-95% ALU utilization.
`rs_decode_kernel_parallel`	Warp-parallel decoder. Vectorized syndrome computation, Berlekamp-Massey, parallel Chien search with `__ballot_sync`, parallel Forney. Corrects both data and parity.
`rs_verify_and_refresh_batch_fused`	Single-kernel verify + re-encode. Loads data once, corrects in shared memory, emits fresh parity. 30-40% faster than separate decode + encode.

GF(256) arithmetic is computed inline (Russian Peasant multiplication, Fermat's little theorem for inversion) -- no lookup tables, no shared memory loads. This killed 96% of MIO stalls that the table-based approach had (measured with NCU).

Building

You need:

Rust 1.70+
NVIDIA CUDA Toolkit 12.0+ (nvcc on PATH)
NVIDIA GPU (defaults to sm_89 / Ada Lovelace, works down to CC 3.0)

cargo build --release

build.rs generates GF(256) tables in Rust, writes them as CUDA headers, compiles ecc_kernels.cu to PTX with nvcc, and embeds the PTX in the binary via include_str!. To target a different GPU arch, change -arch=sm_89 in build.rs.

Usage

# Test GPU memory allocation and block pool
vramfs test --size 1G --ecc

# Test with specific GPU
vramfs test --size 512M --ecc --device 1

FUSE mounting (vramfs mount) is not yet implemented.

Running tests

# Unit and integration tests (requires a CUDA-capable GPU)
cargo test --release

# Extended fuzzing tests (longer runs, stress tests)
cargo test --release --features extended-tests

# Benchmarks
cargo bench

Profiling

# Build with NVTX instrumentation (zero overhead when disabled)
cargo build --release --features profile

# Profile with Nsight Systems
nsys profile --trace=cuda,nvtx ./target/release/profile_level4_pipeline

# Profile with Nsight Compute (kernel-level)
ncu --set full ./target/release/profile_level4_encoders

Dedicated profiling binaries: profile_level3_stream_pool, profile_level4_decoder, profile_level4_encoders, profile_level4_pipeline, profile_level4_gpu_parallelism.

Project structure

src/
├── main.rs                    CLI entry point
├── lib.rs                     Public API re-exports
├── gpu/
│   ├── region.rs              GPU memory region (single contiguous alloc)
│   ├── pool.rs                Lock-free block allocator
│   ├── handle.rs              RAII block handle with H2D/D2H transfers
│   ├── stream_pool.rs         CUDA stream pool
│   ├── ecc.rs                 GPU ECC engine (kernel dispatch)
│   ├── ecc_engine.rs          EccEngine trait
│   ├── ecc_kernels.cu         CUDA kernel entry point
│   ├── galois_field_computed.cuh   GF(256) computed arithmetic
│   ├── rs_encoder_universal.cuh    Warp-parallel RS encoder
│   ├── rs_decoder_universal.cuh    Warp-parallel RS decoder
│   ├── rs_fused_verify_refresh.cuh Fused verify + re-encode
│   ├── gf256_gen.rs           Build-time table generation
│   ├── constants.rs           Block sizes, RS parameters
│   └── profiling.rs           NVTX instrumentation
├── bin/                       Profiling binaries
build.rs                       GF(256) codegen + nvcc compilation
tests/                         ECC correctness, fuzzing, edge cases
benches/                       Criterion benchmarks

Status

GPU memory management and ECC work. FUSE filesystem layer is next.

Done: memory allocation, block pool, stream pool, RS encode/decode/verify, fused verify+refresh, NVTX profiling, tests
Next: FUSE integration, file/directory metadata, read/write syscall path

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
benches		benches
examples		examples
profiling		profiling
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
FUZZING_TESTS.md		FUZZING_TESTS.md
PINNED_MEMORY_GUIDE.md		PINNED_MEMORY_GUIDE.md
README.md		README.md
build.rs		build.rs

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vramfs-rs

Why

How it works

Reed-Solomon ECC

Building

Usage

Running tests

Profiling

Project structure

Status

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vramfs-rs

Why

How it works

Reed-Solomon ECC

Building

Usage

Running tests

Profiling

Project structure

Status

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages