A filesystem that stores data in GPU VRAM instead of system RAM.
Rust + CUDA. Error correction runs entirely on the GPU as custom kernels -- no CPU fallback, no host-side Reed-Solomon.
GPUs have 8-24GB of memory that mostly goes unused outside of rendering and ML. VRAM bandwidth is 900+ GB/s on Ada Lovelace vs ~50 GB/s for DDR5. vramfs exposes that memory as a block device with Reed-Solomon error correction -- scratch space, temp storage, anything where you want throughput and don't need persistence.
Host (CPU) Device (GPU VRAM)
┌──────────┐ PCIe DMA ┌──────────────────────────────┐
│ FUSE │◄─────────────►│ [block 0][block 1]...[N] │
│ layer │ async H2D │ [ecc 0 ][ecc 1 ]...[N] │
│ (WIP) │ async D2H │ │
└──────────┘ │ Stream Pool (per-thread) │
│ RS Encode/Decode Kernels │
└──────────────────────────────┘
Memory layout. A single cuMemAlloc call grabs the entire region upfront. Data blocks (128KB each) occupy the front; ECC parity (8KB per block, 6.25% overhead) occupies the tail. No fragmentation, no runtime allocation.
Block pool. Lock-free concurrent queue (crossbeam::SegQueue). Allocate = pop, free = push. ~100ns per op. RAII handles (BlockHandle) free blocks on drop.
Stream pool. Queries SM count and DMA engine count at init, pre-allocates CUDA streams accordingly. Per-thread caching via thread_local! (~5ns after first hit). Stream pairs let you overlap compute and DMA.
RS(128, 120) over GF(256). Each 128-byte cache line carries 120 data bytes and 8 parity bytes. Corrects up to 4 byte errors, detects up to 8.
Split between build-time Rust (table generation) and runtime CUDA (kernels):
Build time (build.rs):
- Generates GF(256) exp/log/inverse tables from the primitive polynomial x^8 + x^4 + x^3 + x^2 + 1
- Computes the RS generator polynomial g(x) = (x - α^1)(x - α^2)...(x - α^8)
- Inverts the 8x8 Vandermonde matrix for the parallel encoder
- Precomputes alpha powers for Estrin's polynomial evaluation scheme
- Writes it all as
__constant__CUDA headers, compiles to PTX with nvcc
Runtime (CUDA kernels):
| Kernel | What it does |
|---|---|
rs_encode_batch_parallel |
Warp-parallel encoder. 32 threads per cache line. Vectorized async loads, Horner + Estrin evaluation, warp shuffle matrix multiply. 86-95% ALU utilization. |
rs_decode_kernel_parallel |
Warp-parallel decoder. Vectorized syndrome computation, Berlekamp-Massey, parallel Chien search with __ballot_sync, parallel Forney. Corrects both data and parity. |
rs_verify_and_refresh_batch_fused |
Single-kernel verify + re-encode. Loads data once, corrects in shared memory, emits fresh parity. 30-40% faster than separate decode + encode. |
GF(256) arithmetic is computed inline (Russian Peasant multiplication, Fermat's little theorem for inversion) -- no lookup tables, no shared memory loads. This killed 96% of MIO stalls that the table-based approach had (measured with NCU).
You need:
- Rust 1.70+
- NVIDIA CUDA Toolkit 12.0+ (nvcc on PATH)
- NVIDIA GPU (defaults to sm_89 / Ada Lovelace, works down to CC 3.0)
cargo build --releasebuild.rs generates GF(256) tables in Rust, writes them as CUDA headers, compiles ecc_kernels.cu to PTX with nvcc, and embeds the PTX in the binary via include_str!. To target a different GPU arch, change -arch=sm_89 in build.rs.
# Test GPU memory allocation and block pool
vramfs test --size 1G --ecc
# Test with specific GPU
vramfs test --size 512M --ecc --device 1FUSE mounting (vramfs mount) is not yet implemented.
# Unit and integration tests (requires a CUDA-capable GPU)
cargo test --release
# Extended fuzzing tests (longer runs, stress tests)
cargo test --release --features extended-tests
# Benchmarks
cargo bench# Build with NVTX instrumentation (zero overhead when disabled)
cargo build --release --features profile
# Profile with Nsight Systems
nsys profile --trace=cuda,nvtx ./target/release/profile_level4_pipeline
# Profile with Nsight Compute (kernel-level)
ncu --set full ./target/release/profile_level4_encodersDedicated profiling binaries: profile_level3_stream_pool, profile_level4_decoder, profile_level4_encoders, profile_level4_pipeline, profile_level4_gpu_parallelism.
src/
├── main.rs CLI entry point
├── lib.rs Public API re-exports
├── gpu/
│ ├── region.rs GPU memory region (single contiguous alloc)
│ ├── pool.rs Lock-free block allocator
│ ├── handle.rs RAII block handle with H2D/D2H transfers
│ ├── stream_pool.rs CUDA stream pool
│ ├── ecc.rs GPU ECC engine (kernel dispatch)
│ ├── ecc_engine.rs EccEngine trait
│ ├── ecc_kernels.cu CUDA kernel entry point
│ ├── galois_field_computed.cuh GF(256) computed arithmetic
│ ├── rs_encoder_universal.cuh Warp-parallel RS encoder
│ ├── rs_decoder_universal.cuh Warp-parallel RS decoder
│ ├── rs_fused_verify_refresh.cuh Fused verify + re-encode
│ ├── gf256_gen.rs Build-time table generation
│ ├── constants.rs Block sizes, RS parameters
│ └── profiling.rs NVTX instrumentation
├── bin/ Profiling binaries
build.rs GF(256) codegen + nvcc compilation
tests/ ECC correctness, fuzzing, edge cases
benches/ Criterion benchmarks
GPU memory management and ECC work. FUSE filesystem layer is next.
- Done: memory allocation, block pool, stream pool, RS encode/decode/verify, fused verify+refresh, NVTX profiling, tests
- Next: FUSE integration, file/directory metadata, read/write syscall path
MIT