Skip to content

psaboia/diffusiongemma-server

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

diffusiongemma-server

Docker-Compose template to self-host Google's DiffusionGemma 26B A4B-it (released 2026-06-10) on an NVIDIA GPU host as an OpenAI-compatible HTTP endpoint, using llama.cpp.

DiffusionGemma is a 26B-parameter Mixture-of-Experts (3.8B active per token) that uses block diffusion instead of token-by-token autoregression — it generates 256-token blocks in parallel via iterative denoising, trading some quality for throughput.

Why llama.cpp

vLLM has day-zero kernel claims for DiffusionGemma but the public vllm/vllm-openai:v0.12.0 image fails on this model (the transformers fallback in the image hits an assert top_k is not None in its MoE handler against the new diffusion_gemma config schema). Until vLLM ships a release with native diffusion_gemma support, llama.cpp via PR #24427 is the stable serving path.

Prerequisites

  • Linux host with an NVIDIA GPU (≥17 GB VRAM for Q4_K_M, ≥27 GB for Q8_0)

  • NVIDIA driver + CUDA-capable hardware

  • Docker Engine ≥20.10 and Docker Compose v2

  • NVIDIA Container Toolkit installed and registered with Docker:

    sudo nvidia-ctk runtime configure --runtime=docker
    sudo systemctl restart docker
    docker run --rm --gpus all nvidia/cuda:12.4.1-base-ubuntu22.04 nvidia-smi  # smoke test
  • A Hugging Face account + access token (https://huggingface.co/settings/tokens). The GGUF this template pulls (unsloth/diffusiongemma-26B-A4B-it-GGUF) is public, so a read-scoped token is enough.

Quick start

git clone https://github.com/psaboia/diffusiongemma-server.git
cd diffusiongemma-server

cp .env.example .env
# edit .env and set HF_TOKEN (and optionally HF_CACHE_HOST to reuse an
# existing $HOME/.cache/huggingface so the GGUF isn't re-downloaded).

docker compose up -d --build
docker compose logs -f

The first build takes 15–25 min (clones llama.cpp, builds with CUDA). The first run downloads the GGUF (~17 GB for Q4_K_M) into the mounted cache.

Once you see HTTP server listening on 0.0.0.0:8080, test:

curl http://localhost:8000/v1/models

Using the endpoint

OpenAI-compatible. Most useful paths:

Method Path Purpose
GET /v1/models List served models
POST /v1/chat/completions Chat completion
POST /v1/completions Raw completion
GET /health Liveness

Chat example:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "unsloth/diffusiongemma-26B-A4B-it-GGUF:Q4_K_M",
    "messages": [{"role": "user", "content": "Explain block diffusion in one paragraph."}],
    "max_tokens": 512
  }'

The OpenAI Python SDK works against this endpoint by setting base_url="http://<host>:8000/v1" and any non-empty api_key.

Configuration

All knobs live in .env. Defaults serve Q4_K_M on a single GPU at port 8000 with a 32 K context. The full list with descriptions is in .env.example.

Variable Default What it controls
HF_TOKEN (required) Hugging Face token for the GGUF download
HF_CACHE_HOST ./hf-cache Host path mounted at /root/.cache/huggingface. Point at an existing cache to skip re-downloads.
MODEL_REPO unsloth/diffusiongemma-26B-A4B-it-GGUF Hugging Face repo for the GGUF
MODEL_QUANT Q4_K_M Quantization tag (Q4_K_M, Q5_K_M, Q6_K, Q8_0, BF16)
CONTEXT_LENGTH 32768 Maximum total tokens per request (prompt + completion). Model supports up to 262144.
GPU_LAYERS 99 Layers offloaded to GPU (use lower if model doesn't fit)
GPU_COUNT 1 How many GPUs to expose to the container
PORT 8000 Host port mapped to the container's 8080

Edit .env, then docker compose down && docker compose up -d.

Known limitations

Multimodal (image input) is not available yet. DiffusionGemma is multimodal (image-text-to-text) but the community-converted GGUFs only contain the language-model weights — the vision encoder (the mmproj file llama.cpp needs) hasn't been published. Image requests return:

500 Internal Server Error: image input is not supported - hint: if this is unexpected, you may need to provide the mmproj

Watch the upstream repos for an mmproj export:

Troubleshooting

undefined reference to 'cuGetErrorString' during docker compose build

The CUDA dev image ships libcudart but the driver library libcuda.so normally comes from the host at runtime. The Dockerfile already symlinks the SDK's stub copy into a standard ld search path; if you rebuild and still hit this, make sure /usr/local/cuda/lib64/stubs/libcuda.so exists inside the nvidia/cuda:*-devel-* base image you're using.

HTTPS is not supported. Please rebuild with one of: -DLLAMA_OPENSSL=ON ...

llama.cpp's -hf flag needs HTTPS to reach Hugging Face. The Dockerfile sets -DLLAMA_OPENSSL=ON and installs libssl-dev; if you forked the Dockerfile and dropped either, the -hf download will fail with this message.

Slow first request after the container starts

First inference loads the model into VRAM (~5–10 s). Subsequent requests reuse the loaded weights.

License

Apache-2.0. See LICENSE.

About

Docker-Compose template to self-host Google DiffusionGemma 26B on an NVIDIA GPU host via llama.cpp

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors