Docker-Compose template to self-host Google's DiffusionGemma 26B A4B-it (released 2026-06-10) on an NVIDIA GPU host as an OpenAI-compatible HTTP endpoint, using llama.cpp.
DiffusionGemma is a 26B-parameter Mixture-of-Experts (3.8B active per token) that uses block diffusion instead of token-by-token autoregression — it generates 256-token blocks in parallel via iterative denoising, trading some quality for throughput.
vLLM has day-zero kernel claims for DiffusionGemma but the public vllm/vllm-openai:v0.12.0 image fails on this model (the transformers fallback in the image hits an assert top_k is not None in its MoE handler against the new diffusion_gemma config schema). Until vLLM ships a release with native diffusion_gemma support, llama.cpp via PR #24427 is the stable serving path.
-
Linux host with an NVIDIA GPU (≥17 GB VRAM for Q4_K_M, ≥27 GB for Q8_0)
-
NVIDIA driver + CUDA-capable hardware
-
Docker Engine ≥20.10 and Docker Compose v2
-
NVIDIA Container Toolkit installed and registered with Docker:
sudo nvidia-ctk runtime configure --runtime=docker sudo systemctl restart docker docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi # smoke test -
A Hugging Face account + access token (https://huggingface.co/settings/tokens). The GGUF this template pulls (
unsloth/diffusiongemma-26B-A4B-it-GGUF) is public, so a read-scoped token is enough.
git clone https://github.com/psaboia/diffusiongemma-server.git
cd diffusiongemma-server
cp .env.example .env
# edit .env and set HF_TOKEN (and optionally HF_CACHE_HOST to reuse an
# existing $HOME/.cache/huggingface so the GGUF isn't re-downloaded).
docker compose up -d --build
docker compose logs -fThe first build takes 15–25 min (clones llama.cpp, builds with CUDA). The first run downloads the GGUF (~17 GB for Q4_K_M) into the mounted cache.
Once you see HTTP server listening on 0.0.0.0:8080, test:
curl http://localhost:8000/v1/modelsOpenAI-compatible. Most useful paths:
| Method | Path | Purpose |
|---|---|---|
GET |
/v1/models |
List served models |
POST |
/v1/chat/completions |
Chat completion |
POST |
/v1/completions |
Raw completion |
GET |
/health |
Liveness |
Chat example:
curl http://localhost:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M",
"messages": [{"role": "user", "content": "Explain block diffusion in one paragraph."}],
"max_tokens": 512
}'The OpenAI Python SDK works against this endpoint by setting base_url="http://<host>:8000/v1" and any non-empty api_key.
All knobs live in .env. Defaults serve Q4_K_M on a single GPU at port 8000 with a 32 K context. The full list with descriptions is in .env.example.
| Variable | Default | What it controls |
|---|---|---|
HF_TOKEN |
(required) | Hugging Face token for the GGUF download |
HF_CACHE_HOST |
./hf-cache |
Host path mounted at /root/.cache/huggingface. Point at an existing cache to skip re-downloads. |
MODEL_REPO |
unsloth/diffusiongemma-26B-A4B-it-GGUF |
Hugging Face repo for the GGUF |
MODEL_QUANT |
Q4_K_M |
Quantization tag (Q4_K_M, Q5_K_M, Q6_K, Q8_0, BF16) |
CONTEXT_LENGTH |
32768 |
Maximum total tokens per request (prompt + completion). Model supports up to 262144. |
GPU_LAYERS |
99 |
Layers offloaded to GPU (use lower if model doesn't fit) |
GPU_COUNT |
1 |
How many GPUs to expose to the container |
PORT |
8000 |
Host port mapped to the container's 8080 |
Edit .env, then docker compose down && docker compose up -d.
Multimodal (image input) is not available yet. DiffusionGemma is multimodal (image-text-to-text) but the community-converted GGUFs only contain the language-model weights — the vision encoder (the mmproj file llama.cpp needs) hasn't been published. Image requests return:
500 Internal Server Error: image input is not supported - hint: if this is unexpected, you may need to provide the mmproj
Watch the upstream repos for an mmproj export:
unsloth/diffusiongemma-26B-A4B-it-GGUFggml-org/llama.cppPR #24423 — Daniel Han's umbrella DiffusionGemma PRggml-org/llama.cppPR #24427 — the block-diffusion support PR this template is built on
The CUDA dev image ships libcudart but the driver library libcuda.so normally comes from the host at runtime. The Dockerfile already symlinks the SDK's stub copy into a standard ld search path; if you rebuild and still hit this, make sure /usr/local/cuda/lib64/stubs/libcuda.so exists inside the nvidia/cuda:*-devel-* base image you're using.
llama.cpp's -hf flag needs HTTPS to reach Hugging Face. The Dockerfile sets -DLLAMA_OPENSSL=ON and installs libssl-dev; if you forked the Dockerfile and dropped either, the -hf download will fail with this message.
First inference loads the model into VRAM (~5–10 s). Subsequent requests reuse the loaded weights.
Apache-2.0. See LICENSE.