diffusiongemma-server

Docker-Compose template to self-host Google's DiffusionGemma 26B A4B-it (released 2026-06-10) on an NVIDIA GPU host as an OpenAI-compatible HTTP endpoint, using llama.cpp.

DiffusionGemma is a 26B-parameter Mixture-of-Experts (3.8B active per token) that uses block diffusion instead of token-by-token autoregression — it generates 256-token blocks in parallel via iterative denoising, trading some quality for throughput.

Why llama.cpp

vLLM has day-zero kernel claims for DiffusionGemma but the public vllm/vllm-openai:v0.12.0 image fails on this model (the transformers fallback in the image hits an assert top_k is not None in its MoE handler against the new diffusion_gemma config schema). Until vLLM ships a release with native diffusion_gemma support, llama.cpp via PR #24427 is the stable serving path.

Prerequisites

Linux host with an NVIDIA GPU (≥17 GB VRAM for Q4_K_M, ≥27 GB for Q8_0)
NVIDIA driver + CUDA-capable hardware
Docker Engine ≥20.10 and Docker Compose v2

NVIDIA Container Toolkit installed and registered with Docker:

sudo nvidia-ctk runtime configure --runtime=docker
sudo systemctl restart docker
docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi  # smoke test

A Hugging Face account + access token (https://huggingface.co/settings/tokens). The GGUF this template pulls (unsloth/diffusiongemma-26B-A4B-it-GGUF) is public, so a read-scoped token is enough.

Quick start

git clone https://github.com/psaboia/diffusiongemma-server.git
cd diffusiongemma-server

cp .env.example .env
# edit .env and set HF_TOKEN (and optionally HF_CACHE_HOST to reuse an
# existing $HOME/.cache/huggingface so the GGUF isn't re-downloaded).

docker compose up -d --build
docker compose logs -f

The first build takes 15–25 min (clones llama.cpp, builds with CUDA). The first run downloads the GGUF (~17 GB for Q4_K_M) into the mounted cache.

Once you see HTTP server listening on 0.0.0.0:8080, test:

curl http://localhost:8000/v1/models

Using the endpoint

OpenAI-compatible. Most useful paths:

Method	Path	Purpose
`GET`	`/v1/models`	List served models
`POST`	`/v1/chat/completions`	Chat completion
`POST`	`/v1/completions`	Raw completion
`GET`	`/health`	Liveness

Chat example:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M",
    "messages": [{"role": "user", "content": "Explain block diffusion in one paragraph."}],
    "max_tokens": 512
  }'

The OpenAI Python SDK works against this endpoint by setting base_url="http://<host>:8000/v1" and any non-empty api_key.

Configuration

All knobs live in .env. Defaults serve Q4_K_M on a single GPU at port 8000 with a 32 K context. The full list with descriptions is in .env.example.

Variable	Default	What it controls
`HF_TOKEN`	(required)	Hugging Face token for the GGUF download
`HF_CACHE_HOST`	`./hf-cache`	Host path mounted at `/root/.cache/huggingface`. Point at an existing cache to skip re-downloads.
`MODEL_REPO`	`unsloth/diffusiongemma-26B-A4B-it-GGUF`	Hugging Face repo for the GGUF
`MODEL_QUANT`	`Q4_K_M`	Quantization tag (`Q4_K_M`, `Q5_K_M`, `Q6_K`, `Q8_0`, `BF16`)
`CONTEXT_LENGTH`	`32768`	Maximum total tokens per request (prompt + completion). Model supports up to 262144.
`GPU_LAYERS`	`99`	Layers offloaded to GPU (use lower if model doesn't fit)
`GPU_COUNT`	`1`	How many GPUs to expose to the container
`PORT`	`8000`	Host port mapped to the container's 8080

Edit .env, then docker compose down && docker compose up -d.

Known limitations

Multimodal (image input) is not available yet. DiffusionGemma is multimodal (image-text-to-text) but the community-converted GGUFs only contain the language-model weights — the vision encoder (the mmproj file llama.cpp needs) hasn't been published. Image requests return:

500 Internal Server Error: image input is not supported - hint: if this is unexpected, you may need to provide the mmproj

Watch the upstream repos for an mmproj export:

unsloth/diffusiongemma-26B-A4B-it-GGUF
ggml-org/llama.cpp PR #24423 — Daniel Han's umbrella DiffusionGemma PR
ggml-org/llama.cpp PR #24427 — the block-diffusion support PR this template is built on

Troubleshooting

`undefined reference to 'cuGetErrorString'` during `docker compose build`

The CUDA dev image ships libcudart but the driver library libcuda.so normally comes from the host at runtime. The Dockerfile already symlinks the SDK's stub copy into a standard ld search path; if you rebuild and still hit this, make sure /usr/local/cuda/lib64/stubs/libcuda.so exists inside the nvidia/cuda:*-devel-* base image you're using.

`HTTPS is not supported. Please rebuild with one of: -DLLAMA_OPENSSL=ON ...`

llama.cpp's -hf flag needs HTTPS to reach Hugging Face. The Dockerfile sets -DLLAMA_OPENSSL=ON and installs libssl-dev; if you forked the Dockerfile and dropped either, the -hf download will fail with this message.

Slow first request after the container starts

First inference loads the model into VRAM (~5–10 s). Subsequent requests reuse the loaded weights.

License

Apache-2.0. See LICENSE.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

diffusiongemma-server

Why llama.cpp

Prerequisites

Quick start

Using the endpoint

Configuration

Known limitations

Troubleshooting

`undefined reference to 'cuGetErrorString'` during `docker compose build`

`HTTPS is not supported. Please rebuild with one of: -DLLAMA_OPENSSL=ON ...`

Slow first request after the container starts

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

diffusiongemma-server

Why llama.cpp

Prerequisites

Quick start

Using the endpoint

Configuration

Known limitations

Troubleshooting

undefined reference to 'cuGetErrorString' during docker compose build

HTTPS is not supported. Please rebuild with one of: -DLLAMA_OPENSSL=ON ...

Slow first request after the container starts

License

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

`undefined reference to 'cuGetErrorString'` during `docker compose build`

`HTTPS is not supported. Please rebuild with one of: -DLLAMA_OPENSSL=ON ...`

Packages