Reproducible recipe: serve abliterated Gemma-4-12B (gemma4_unified) at 50-118 tok/s on no-NVLink Blackwell (SM120) via vLLM nightly + ModelOpt FP8/NVFP4 + MTP spec-decode.
-
Updated
Jun 7, 2026 - Python
Reproducible recipe: serve abliterated Gemma-4-12B (gemma4_unified) at 50-118 tok/s on no-NVLink Blackwell (SM120) via vLLM nightly + ModelOpt FP8/NVFP4 + MTP spec-decode.
Optimized vLLM setup for Gemma 4 31B NVFP4 with MTP on dual RTX PRO 6000 Blackwell using vllm and docker: native FP4 Tensor Cores, Multi-Token Prediction (96.5% acceptance rate), and prefix caching. Includes benchmark results and replication scripts.
Patches + recipe to deploy festr2/MiMo-V2.5-Pro-NVFP4-MXFP8-attn-TP8 on 8-node DGX Spark sm_121 (Ray + vLLM, TP=8). Fixes the fused-qkv loader bug that mis-slotted Q values as K/V on 7 of 8 ranks.
Field-tested guide: multi-GPU vLLM tensor-parallel (TP=2/TP=4) on Intel Arc Pro B70 (Battlemage BMG-G31, Xe2) on Linux. Driver setup (xe force_probe=e223), bare-metal vLLM + oneAPI 2025.3, the compute-runtime multi-root USM + triton-xpu init_devices fixes, FP8/int4-AutoRound quant, root-cause error reports. AI-agent readable (AGENTS.md).
Add a description, image, and links to the tensor-parallel topic page so that developers can more easily learn about it.
To associate your repository with the tensor-parallel topic, visit your repo's landing page and select "manage topics."