Optimize my llama.cpp #21112

ggerganov · 2026-03-28T09:35:38Z

ggerganov
Mar 28, 2026
Maintainer

Overview

Are you using llama.cpp and wondering if you are getting the most out of your hardware?

Post your parameters below and get some help from the community to improve the performance. Sometimes, adjusting a few parameters can make a big difference in terms of speed and/or quality.

Information needed:

Hardware spec: (machines, GPUs, CPU, RAM)
llama-server command that you are currently using
One specific model that you are targeting
Explain briefly your use case: how many users, client, agent harness, etc.
Have an objective way to evaluate your current performance (usually llama-bench, but could be something else depending on the use case)
Keep your posts short and focused
Post one thread per setup
Provide feedback if the recommended parameter changes have been helpful

wbste · 2026-03-28T15:58:17Z

wbste
Mar 28, 2026

I'm the sole user of these models (at most an embedding and llm model at the same time). Usually just chatting back and forth with basic tool calls via MCP and the llama-server webui or Pi for cli.

Request: MoE models that are larger than VRAM (let's focus on NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q3_K_XL), I'm trying to squeeze more tg out of them. Some numbers from llama-bench using llama-fit-params prior to running to find optimal -ot. Below are average tg rate at gen128 from 0-16k depth

Unsloth's gpt-oss-120b-Q8_0: ~30 tok/s
Unsloth's NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q3_K_XL: ~16 tok/s

I know file sizes and active parameters all impact the numbers and you can't expect apples to apples between gpt-oss model architecture and others; just want to make sure I'm not missing any dials to tweak. Thanks everyone for this amazing project!

Additional Questions

I notice ggml-org vision models sometimes have q8 and f16 mmproj files. Have you noticed any quality differences between those?
Anything I should tweak for embeddings (see below)?

System Setup

RTX 3090 (24 GB VRAM) | 128 DDR5 6400 MT/s RAM | Intel Core Ultra 265k

llama-server --version
ggml_cuda_init: found 1 CUDA devices (Total VRAM: 24575 MiB):
  Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes, VRAM: 24575 MiB
load_backend: loaded CUDA backend from C:\llama\ggml-cuda.dll
load_backend: loaded RPC backend from C:\llama\ggml-rpc.dll
load_backend: loaded CPU backend from C:\llama\ggml-cpu-alderlake.dll
version: 8563 (1f5d15e66)
built with Clang 19.1.5 for Windows x86_64

I use llama-server and the models preset:

--models-preset presets.ini --models-max 2 --sleep-idle-seconds 600 --host 0.0.0.0 --port 9292

Portions of my presets.ini below

[*]
batch-size = 4096
ctx-size = 32000
jinja = true
parallel = 2

For embedding models I add this at each model level

batch-size = 16384
ctx-size = 32768
embeddings = on
parallel = 8
ubatch-size = 2048

Llama-Bench Command

llama-bench -m 'D:\AI\models\unsloth\NVIDIA-Nemotron-3-Super-120B-A12B-GGUF\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q3_K_XL-00001-of-00003.gguf' -ngl 89 -ot 'blk\.21\.ffn_down.*=CPU;blk\.22\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.23\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.24\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.25\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.26\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.27\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.28\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.29\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.30\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.31\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.32\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.33\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.34\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.35\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.36\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.37\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.38\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.39\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.40\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.41\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.42\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.43\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.44\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.45\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.46\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.47\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.48\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.49\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.50\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.51\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.52\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.53\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.54\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.55\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.56\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.57\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.58\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.59\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.60\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.61\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.62\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.63\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.64\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.65\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.66\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.67\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.68\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.69\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.70\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.71\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.72\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.73\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.74\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.75\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.76\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.77\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.78\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.79\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.80\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.81\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.82\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.83\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.84\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.85\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.86\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.87\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU;blk\.88\.ffn_(up|down|gate_up|gate)_(ch|)exps=CPU' -p 0 -n 128 -d 4096 -fa 1

4 replies

am17an Mar 28, 2026
Collaborator

I've been trying to optimize this use-case. There are somethings you can try:

Try playing around with -t (number of threads). Since your machine has 8 performance cores, it's important to use those
Since you have enough ram, try --mmap 0 (this helps with prefill)
I tried optimizing decode path in ggml-cpu: improve --n-cpu-moe TG performance #20596, but results are mixed across hardware as of now. You can try and let us know

BTW your llama-bench's overrides can be simplified using -ncmoe 999 (no need add those manual overrides)

wbste Mar 28, 2026

Thanks for the reply!

I've tried playing with P-cores only on my 265k with --threads 8 --cpu-range 0,1,6-9,18,19, but was a bit slower than just using them all. Will try again.
-ncmoe 999 would keep all MoE layers on cpu, which means I don't fill up VRAM and makes this all slower.

Updated tests (just depth of 4096)

llama-bench.exe -m 'D:\AI\models\unsloth\NVIDIA-Nemotron-3-Super-120B-A12B-GGUF\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q3_K_XL-00001-of-00003.gguf' -ngl 89 -ncmoe 999 -p 0 -n 128 -d 4096 -fa 1: 12.72 tok/s
llama-bench.exe -m 'D:\AI\models\unsloth\NVIDIA-Nemotron-3-Super-120B-A12B-GGUF\NVIDIA-Nemotron-3-Super-120B-A12B-UD-Q3_K_XL-00001-of-00003.gguf' -ncmoe 999 -p 0 -n 128 -d 4096 -fa 1: 12.78 tok/s
my original command + --threads 8 -C 0xC03C3 --cpu-strict 1: 12.23 tok/s
my original command: 16.6 tok/s
my original command + --mmap 0: 16.8 tok/s (I know you said for pp, was just curious on tg impact)

karambaso Mar 28, 2026

* I've tried playing with P-cores only on my 265k with `--threads 8 --cpu-range 0,1,6-9,18,19`, but was a bit slower than just using them all. Will try again.

I recommend to use all cores, in my test on the same cpu 96 threads was the best for prompt processing, when 24 threads was the best for token generation.

ggerganov Mar 29, 2026
Maintainer Author

I notice ggml-org vision models sometimes have q8 and f16 mmproj files. Have you noticed any quality differences between those?

I haven't done extensive tests yet. My guess would be that quality-wise there should be no measurable difference between the two (q8_0 vs f16). The reason to upload different mmproj types is simply that we are not very systematic when uploading the models to ggml-org - something to improve.

Anything I should tweak for embeddings (see below)?

I think your current config is solid. With ubatch-size = 2048, it means that you are computing embeddings with maximum length of 2048 tokens per sequence. Adding batch-size = 16384 + parallel = 8 is correct to ensure that you will be processing them in parallel within one logical batch.

The most common mistake with embeddings in llama.cpp is to forget to add the dense modules when converting the Python model to GGUF. This is the --sentence-transformers-dense-modules flag of convert_hf_to_gguf.py. More info at: #16367

Dampfinchen · 2026-03-29T13:21:45Z

Dampfinchen
Mar 29, 2026

Great thread!

I'm running Qwen 35B A3B on my RTX 2060 laptop (32GB RAM, 6 GB VRAM, i7 9750H, Windows 11) and this is the first model I am able to run at ridiculous amounts of context at great speeds.

./llama-server -m "Qwen 3.5\Qwen_Qwen3.5-35B-A3B-Q4_K_M.gguf" -c 102144 -fa 1 --host 0.0.0.0 --port 5001 -ub 2048 --jinja -ngl 99 --n-cpu-moe 99 -ctv q8_0 -ctk q8_0 --temp 1 --top-p 0.95 --top-k 20 --min-p 0.0 --presence-penalty 1.5 --repeat-penalty 1.0 --mmproj "mmproj-Qwen3.5-35B-A3B-Q8_0.gguf" --no-mmproj-offload

This is the best configuration I have come up with. With that, the experts are running on the CPU while the other layers run on the GPU. ubatch 2048 gives a huge speedup to prompt processing which is sorely needed.

This way I'm getting around 350-400 token/s prefill on 102K context and a text generation of around 15 token/s. So I'm very pleased how well it runs.

So pure text generation is great.

However, as you have noticed, I have offloaded the mmproj entirely on the CPU. Why? Because it needs around 600 MB VRAM and that would greatly reduce the effective context I am able to run. On the CPU it can be very slow, it takes up to 300 seconds on decently sized images like browser snapshots.

I wonder if there is any way to either make the mmproj more efficient on the CPU or switch layers from GPU to RAM to free up VRAM for the vision encoder just before the vision processing automatically or with a command and then load them back on the GPU after the vision processing has been completed.

1 reply

ggerganov Apr 3, 2026
Maintainer Author

I think you have the right setup to fit your hardware. Can't recommend any changes to the parameters.

Hot swapping the vision encoder is not possible and probably not be easy to add support for.

eelgaev · 2026-04-03T23:09:25Z

eelgaev
Apr 3, 2026

I might be using one of the more exotic setups :)
It's an IBM AC922 POWER9 cpu with 4 NVLink'ed Tesla V100 16GB (CPU<->GPU BW is 100-150GB/s)

But speeds aren't really that good despite the connectivity.

Kimi-K2 Thinking 1T (Q1) - 5-6 tk/s
NVIDIA Nemotron 49B (Q8) - 12-13 tk/s
NVIDIA Nemotron-Super 120B (Q8) - 20tk/s

I typically execute like this:

GGML_CUDA_ENABLE_UNIFIED_MEMORY=1 ./llama-server --host 0.0.0.0 --port 8081 -m Llama-3_3-Nemotron-Super-49B-v1-UD-Q8_K_XL-00001-of-00002.gguf \
-ngl 99 --keep -1 --ctx-size 40000 --flash-attn on --numa distribute --parallel 1 --no-context-shift \
--repeat-penalty 1.1 --presence-penalty 0.3 --frequency-penalty 0.5 --top-k 20 --top-p 0.9 \
--mirostat 2 --mirostat-lr 0.1 --mirostat-ent 5 --dry-sequence-breaker none \
-ts 8,8,12,12 --jinja --no-mmap --api-key <redacted> --poll 0 \
-t 8 -tb 32 --kv-unified --alias model

0 replies

gompa-hacs · 2026-04-14T08:44:28Z

gompa-hacs
Apr 14, 2026

i have an old crypto miner, its basically e-waste but it does run smaller models okeyish.
the problem is that it has a way underpowerd cpu and slow pcie so running multiple independent models on different gpu's the cpu and memory become an issue

i'am building a garbage multi agent chat interface, it sort of works as long as you only trigger 1-2 model at the same time

hw:
2 core Intel(R) Celeron(R) CPU 3865U @ 1.80GHz
16gb ddr4
all gpu's are pcie 1x gen1
1x 3060 12gb (swapped one of the p106-90 out for this, i had it laying around)
8x NVIDIA P106-090 6gb

i currently run it like this :
CUDA_VISIBLE_DEVICES=(change depending on the device) screen -d -m ./llama.cpp/build/bin/llama-server -m models/gemma-4-E2B-it-UD-Q4_K_XL.gguf -ngl 999 -t 1 --no-mmap --port 8080 --host 0.0.0.0 --fit on --jinja --no-warmup -cram 1024

it can idle multiple models at the same time, and run active inference on 2 models without too much slowdown but as soon as you hit the third model everything slows down significantly.
setting cram to 0 helps with the cpu memory load but forces prompt reprocessing

it's running different 4b models in Q4, or whatever fits in a single gpu's memory

i was wondering if there is a way to reduce cpu load to be-able to run multiple models concurrently

0 replies

Optimize my llama.cpp #21112

Uh oh!

ggerganov Mar 28, 2026 Maintainer

Overview

Information needed:

Replies: 4 comments · 5 replies

Uh oh!

Uh oh!

wbste Mar 28, 2026

Additional Questions

System Setup

Llama-Bench Command

Uh oh!

Uh oh!

am17an Mar 28, 2026 Collaborator

Uh oh!

wbste Mar 28, 2026

Uh oh!

karambaso Mar 28, 2026

Uh oh!

ggerganov Mar 29, 2026 Maintainer Author

Uh oh!

Uh oh!

Dampfinchen Mar 29, 2026

Uh oh!

Uh oh!

ggerganov Apr 3, 2026 Maintainer Author

Uh oh!

Uh oh!

eelgaev Apr 3, 2026

Uh oh!

gompa-hacs Apr 14, 2026

ggerganov
Mar 28, 2026
Maintainer

Replies: 4 comments 5 replies

wbste
Mar 28, 2026

am17an Mar 28, 2026
Collaborator

ggerganov Mar 29, 2026
Maintainer Author

Dampfinchen
Mar 29, 2026

ggerganov Apr 3, 2026
Maintainer Author

eelgaev
Apr 3, 2026

gompa-hacs
Apr 14, 2026