Version: 1.1 Status: Draft for review (Updated: 2026-01-16)
Note: For detailed TDD test plans for each implementation task, see tasks.md which contains 58 tasks with 400+ test cases.
Primary goals:
- Correctness across ops and backends
- Stable latency and throughput in production-like workloads
- Safe fallback and clear diagnostics on failures
- Portability across hardware/software matrices
Release gates:
- Zero correctness regressions vs reference
- Performance within thresholds or explicit waivers
- No crash, leak, or health breaker regressions in soak tests
- Capabilities and dispatch tables validate against schema
In scope:
- Tokenization, positional ops, attention (prefill/decode/paged), MLP/GEMM, norms, sampling, quantization, and dtype conversions
- Backends: FlashAttention, FlashInfer, xFormers, Triton/Liger, oneDNN, ZenDNN, HF kernels, and PyTorch fallbacks
- CUDA graphs and torch.compile integration paths
Out of scope:
- End-user application correctness beyond kernel orchestration
- Non-deterministic training correctness (forward-only is primary)
Hardware (detailed):
- NVIDIA:
- SM80 (A100), SM86 (A10/RTX 30), SM89 (L4/RTX 40), SM90 (H100), SM100+ (when available)
- Verify BF16/FP16/TF32/FP8 support per device and driver
- AMD ROCm:
- MI200 (gfx90a), MI300 (gfx940/gfx942)
- CPU:
- Intel Ice Lake (AVX-512), Sapphire Rapids (AMX BF16/INT8)
- AMD EPYC Zen3 (AVX2), Zen4 (AVX-512)
- Habana:
- Gaudi2/Gaudi3 where supported
Software (detailed):
- OS: Ubuntu 20.04/22.04, Rocky 8/9 (container and bare metal)
- Python: 3.9, 3.10, 3.11, 3.12
- PyTorch: 2.1-2.4 (stable) plus a nightly lane for early warnings
- CUDA: 11.8, 12.1, 12.4; cuDNN 8/9 where applicable
- ROCm: 5.7, 6.0
- Triton: 2.1, 2.2, 2.3
- Backend versions pinned to the compatibility matrix
Backend isolation modes:
- in_process (single CUDA/ROCm policy)
- subprocess (conflicting stacks)
Precision:
- FP32, TF32, FP16, BF16, INT8, FP8 (E4M3/E5M2), FP4 variants
Correctness baseline:
- PyTorch reference ops for attention, norms, GEMM, and sampling
- Tokenizers: HF tokenizers, tiktoken, sentencepiece for their own outputs
All correctness tests compare against reference outputs with tolerances below.
| Op Category | FP16 | BF16 | FP32 | INT8/FP8/FP4 |
|---|---|---|---|---|
| Attention | rtol=1e-3, atol=1e-3 | rtol=1e-2, atol=1e-2 | rtol=1e-5, atol=1e-5 | exact or rtol=1e-2 where defined |
| Norms | rtol=1e-3, atol=1e-3 | rtol=1e-2, atol=1e-2 | rtol=1e-6, atol=1e-6 | exact for int8 where applicable |
| GEMM/MLP | rtol=1e-3, atol=1e-3 | rtol=1e-2, atol=1e-2 | rtol=1e-6, atol=1e-6 | backend-defined error bounds |
| RoPE/ALiBi | rtol=1e-4, atol=1e-4 | rtol=1e-3, atol=1e-3 | rtol=1e-6, atol=1e-6 | not applicable |
| Sampling | deterministic with fixed seed | deterministic with fixed seed | deterministic with fixed seed | deterministic with fixed seed |
| Tokenizers | exact match | exact match | exact match | exact match |
Notes:
- For quantized backends, specify exact bounds per backend in capabilities.json.
- Sampling is validated by deterministic seed and distribution invariants.
Inputs:
- Vary seq_len: [128, 512, 2048, 8192, 32768]
- Heads: H=32, Hkv=8 (GQA)
- Dtypes: fp16, bf16
Pass criteria:
- Output matches reference within tolerance
- Kernel selection uses valid backends for each shape
- No fallback for supported shapes when preferred backends available
Inputs:
- Paged KV cache with variable context_lens
- block_tables dtype int32 and int64
Pass criteria:
- Output matches reference within tolerance
- KV metadata dtype mismatches are rejected with clear reason codes
Steps:
- Tokenize input texts with tokenizer_id + hashes
- Run attention + MLP + norm blocks
- Sample with top-k/top-p
Pass criteria:
- Tokenization ids and offsets match reference exactly
- Sampling deterministic with fixed seed
- All caches and dispatch tables validated
Inputs:
- Cross-attention masks, long context
Pass criteria:
- Output matches reference within tolerance
- Mask + is_causal is rejected where invalid
Inputs:
- INT8/FP8/FP4 quantized weights and activations
- Per-channel and per-block scales
Pass criteria:
- Outputs match quantized reference within defined bounds
- Scale layout mismatches produce clear reason codes
Inputs:
- RoPE interleaved vs non-interleaved layout
- ALiBi bias with long context
Pass criteria:
- Fused kernels match reference within tolerance
- Non-fused fallback matches reference when fusion unavailable
Inputs:
- Tokenization + attention + MLP + norms on CPU only
Pass criteria:
- Correct fallback selection across all ops
- No GPU-specific code paths executed
Inputs:
- Texts with unicode normalization edge cases
- Added special tokens and custom vocab merges
Pass criteria:
- Exact token ids and offsets vs reference tokenizer
- Hash mismatch invalidates cache entries
Inputs:
- ALiBi or relative bias masks
- Block-sparse masks (if not supported, expect fallback)
Pass criteria:
- Unsupported bias/mask paths fall back safely
- Outputs match reference within tolerance
Inputs:
- Two models with different tokenizer_id and vocab/merges hashes
- Interleaved requests across models
Pass criteria:
- No cache contamination across models
- Selection and tokenization results remain model-specific
Inputs:
- Ragged batch with variable sequence lengths
- Nested tensors where supported (or explicit fallback)
Pass criteria:
- Correct outputs or explicit fallback where nested inputs are unsupported
- Clear error messages for unsupported ragged formats
Inputs:
- kv_cache dtype mismatch (fp16 vs bf16)
- kv_cache layout differences (NHD/HND vs backend expectation)
Pass criteria:
- Explicit dequant/cast when allowed
- KV cache mismatch rejected with reason codes
Inputs:
- Encoder seq_len: [128, 512, 2048]
- Decoder with kv-cache enabled and disabled
- Padding masks with variable lengths
Pass criteria:
- Output matches reference within tolerance
- Cross-attention kernels selected with correct mask handling
- Encoder/decoder cache separation is preserved
Inputs:
- Patch embedding + attention + MLP
- Channels-last and contiguous inputs
Pass criteria:
- Correct outputs for vision encoder blocks
- Layout-sensitive kernels selected only when constraints are met
Inputs:
- Embedding lookup + RMSNorm/LayerNorm + MLP (GELU/SiLU)
- Mixed dtype activations
Pass criteria:
- Kernel selection covers non-attention ops
- Fused kernels used when available, fallback otherwise
Inputs:
- Top-k gating with varying expert counts
- Grouped GEMM paths where available
Pass criteria:
- Deterministic expert routing with fixed seed
- Dispatch kernels selected within tolerance and memory bounds
Inputs:
- Draft + target model pair
- Acceptance rate stress (low/high)
Pass criteria:
- Per-model kernel selection remains isolated
- Acceptance logic yields correct token sequences
Inputs:
- Window sizes [256, 1024]
- Long context with sliding window masks
Pass criteria:
- Correct outputs or explicit fallback for unsupported windows
- No illegal mask/bias combinations accepted
Inputs:
- RoPE scaling (NTK, YaRN), XPos
- Interleaved vs non-interleaved layouts
Pass criteria:
- Correct outputs within tolerance for each variant
- Unsupported variants trigger clear fallback
Pass criteria:
- Fused ops selected when available and beneficial
- Transform penalties steer selection to lower total cost
- SelectionReport includes transform_cost_total and fused_ops
Pass criteria:
- lz.solve emits decision tree for bucketed seq_len/batch
- Bucket miss triggers runtime selection with PLAN_BUCKET_MISS reason
Pass criteria:
- strict_mode rejects implicit layout/dtype transforms
- allow_adaptation performs transforms with explicit cost accounting
Pass criteria:
- prefer/avoid sources honored deterministically
- explicit locks override scoring
- policy_hash changes invalidate cached selections
Pass criteria:
- capabilities_hash changes invalidate cached selections
- backend version changes invalidate PerfDB entries
Pass criteria:
- PLAN_BUCKET_MISS triggers runtime selection
- Runtime selection honors strict_mode and policy rules
Pass criteria:
- Corrupted or partial PerfDB entries are ignored and rebuilt
- High-variance measurements are down-weighted or excluded
Pass criteria:
- Hardware signature or driver change invalidates baked plans
- Policy hash mismatch triggers re-solve
Pass criteria:
- Kernels that allocate or synchronize are rejected
- Dummy capture validation passes for graph-safe kernels
Pass criteria:
- No graph breaks in LayerZero ops
- Meta kernels produce correct shapes and dtypes
Pass criteria:
- Graph capture succeeds on non-default streams
- Graph-safe selections are cached for capture lifetime
Pass criteria:
- Kernels that allocate or synchronize are rejected in graph_strict_mode
- Clear error when no graph-safe kernel exists
Pass criteria:
- Export graphs include correct meta kernels and shapes
- Runtime selection honors exported constraints or fails clearly
Pass criteria:
- In-process mode rejects incompatible CUDA versions
- Clear diagnostics for ABI conflicts
Pass criteria:
- Conflicting backends can run concurrently in subprocess mode
- IPC preserves tensor metadata and error reporting
Pass criteria:
- Reported compatibility matches actual backend availability
- Mismatches are flagged before deployment
Pass criteria:
- Device-specific backends never cross devices
- Clear diagnostics when incompatible backends are loaded together
Pass criteria:
- lz.warmup compiles all required kernels for specified shapes
- No JIT compilation on first request after warmup
Pass criteria:
- JIT cache reused across restarts
- Cache invalidated on backend version changes
Pass criteria:
- Multiple worker processes reuse shared cache without recompiling
- Cache contention does not corrupt artifacts
Pass criteria:
- lz.solve triggers JIT for all bucketed shapes
- No compilation on first request within a bucket
Pass criteria:
- Failed compilation disables the kernel cleanly
- Fallback path executes without repeated failures
Pass criteria:
- Selection avoids kernels exceeding headroom
- MEMORY_HEADROOM_EXCEEDED reason recorded
Pass criteria:
- No memory leaks or unbounded cache growth
- Backend health circuit breaker behaves as expected
Pass criteria:
- Repeated allocations do not degrade performance or cause unexpected OOM
- Selection avoids kernels with large temporary workspaces
Pass criteria:
- On OOM, selection falls back to a lower-memory kernel when possible
- Clear error when no safe fallback exists
Pass criteria:
- Selection adapts when other processes consume device memory
- No unbounded retry loops under pressure
Pass criteria:
- Selection logs include kernel_id, reasons, transform_cost_total
- OTel spans for selection and dispatch (if enabled)
- lz.readiness_check reports schema validity and cache status
Additional checks:
- lz.which/lz.explain output consistent with selection results
- lz.list_kernels and lz.doctor report correct backend status
- Logs are redacted for sensitive tokens when configured
Scenarios:
- Backend import failure
- CUDA illegal memory access
- Invalid capabilities.json or dispatch_table.json
- attn_mask + is_causal invalid combination
- Graph unsafe kernel selected under graph_strict_mode
- PLAN_BUCKET_MISS on unseen shape
- KV cache metadata dtype/layout mismatch
- Tokenizer hash mismatch (vocab/merges/special tokens)
- PerfDB corruption or missing entries
- Backend hang/timeout and recovery
- Device lost/reset during execution
Pass criteria:
- Backend disabled safely
- Fallback triggered with clear reason codes
- Readiness check fails fast for invalid schemas
- Invalid inputs rejected with explicit reason codes
Artifacts:
- capabilities.json and hash
- dispatch_table.json and schema validation report
- perfdb snapshots
- selection logs and telemetry traces
Scenarios:
- Kernel hub allowlist enforcement
- Lockfile pinning validation
- Optional signature verification (when supported)
- Path traversal attempts in hub downloads
- Artifact tamper detection (checksums)
Pass criteria:
- Unauthorized kernels are rejected
- Lockfile violations block loading
Scenarios:
- Multi-threaded selection cache access under high QPS
- Multi-stream CUDA execution with concurrent ops
- Forked workers after backend initialization
Pass criteria:
- No data races or corrupted cache entries
- Selection remains deterministic under identical inputs
Scenarios:
- Rank0 broadcast selection for homogeneous hardware
- Capability-group selection for heterogeneous ranks
- Mixed-device inputs to single op (must be rejected)
- Tensor-parallel attention with per-rank shapes
Pass criteria:
- Consistent selection across ranks or capability groups
- Mixed-device inputs fail with clear diagnostics
Scenarios:
- Selection overhead micro-bench under load
- Baked plan vs runtime selection comparison
- PerfDB timing influence under varying variance
- Cold-start latency with and without warmup
Pass criteria:
- Hot-path selection stays within budget
- Baked plans outperform runtime selection on steady workloads
- PerfDB influence reduced when variance is high
Scenarios:
- Environment variable overrides
- Runtime config changes (prefer/avoid/locks)
- strict_mode and plan_mode toggles
- Per-request overrides and timeboxed policies
Pass criteria:
- Policy changes invalidate caches
- Selection follows updated policy deterministically
- strict_mode rejects implicit transforms when enabled
Scenarios:
- capabilities.json schema mismatch
- dispatch_table.json schema mismatch
- hardware signature mismatch for dispatch tables
- schema fuzzing with random missing/extra fields
- capabilities schema version migration
Pass criteria:
- Fail closed with clear errors
- Readiness check detects invalid schemas
Scenarios:
- entry_points discovery for custom backends
- HF kernel hub download + namespace uniqueness
- Plugin upgrade/downgrade compatibility
Pass criteria:
- Plugins register without user code changes
- Multiple versions load without namespace clashes
Scenarios:
- Mixed dtype Q/K/V and kv-cache dtype mismatch
- Per-channel and per-block scale layouts
- Packed weights cache reuse
- FP4/NF4/INT4 mixed precision paths
Pass criteria:
- Explicit casts or dequant steps applied as required
- Scale layout mismatches rejected with clear reasons
- Packed weights reused without re-pack per call
Scenarios:
- stride(-1) != 1 inputs
- BHSD/BSHD ambiguity cases (S == H)
- attn_mask types (bool vs float) and broadcast rules
- View/transpose inputs with non-contiguous storage
Pass criteria:
- Fused kernels rejected when stride constraints fail
- Ambiguous layouts require explicit hints or fallback
- Mask broadcast violations produce clear errors
Scenarios:
- Missing JIT caches with production config
- Invalid capabilities or dispatch table
- Missing required kernels for configured policy
Pass criteria:
- lz.readiness_check fails fast with actionable report
Release candidate is accepted when:
- All mandatory scenarios pass on at least one target per hardware class
- No correctness regressions beyond tolerance
- Performance within specified thresholds or waived with justification
- Readiness check is green on production configuration
- No open P0 or P1 issues in tracking system