mtmd: fit_params now take into account mmproj by ngxson · Pull Request #21489 · ggml-org/llama.cpp

ngxson · 2026-04-05T21:25:25Z

Overview

Fix #19980 ; Fix #18181

--fit now take into account the memory usage of mmproj and adjust --fit-target accordingly. This is effective on llama-server and llama-cli

llama-mtmd-cli doesn't yet support it because it's mostly used for testing, but it can be added in the future.

Additional information

A new API is added to libmtmd: mtmd_get_memory_usage, which returns the memory usage in bytes. The returned value is the sum of:

Memory needed by mmproj weight, ref: clip_model_loader::load_tensors
Compute buffer, ref: clip_model_loader::alloc_compute_meta

If model has both vision and audio encoder, they will all be taken into account.

Demo

llamac llama-server -hf ggml-org/Qwen2.5-Omni-3B-GGUF

...
srv    load_model: [mtmd] estimated memory usage of mmproj is 2447.80 MiB
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 3287 MiB of device memory vs. 110100 MiB of free device memory
llama_params_fit_impl: will leave 106812 >= 3471 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.21 seconds
llama_model_load_from_file_impl: using device MTL0 (Apple M5 Max) (unknown id) - 110100 MiB free
...

Requirements

I have read and agree with the contributing guidelines
AI usage disclosure: NO

erazortt · 2026-04-05T21:32:48Z

What if I want to purposely have the mmproj overflow into the ram when using unified flag?

PS: and yes, this is usefull! Since in this case mmproj will be mostly in ram, apart from when it is actually needed. Yes, it takes some time for the swapping from ram to vram and back to take place, but it is faster then having it only in ram. this can be usefull if vision is used only unregularily.

ngxson · 2026-04-05T21:59:32Z

What if I want to purposely have the mmproj overflow into the ram when using unified flag?

I don't get what you mean. Can you provide your CLI arguments?

strawberrymelonpanda · 2026-04-05T22:38:06Z

What if I want to purposely have the mmproj overflow into the ram when using unified flag?

Could you just adjust the fit target to lower than the default 1024MB?

Depending on the size of the mmproj, --fitt 512 or even --fitt 0 comes close, though not exact.
(Qwen3 35B at 82K without an mmproj vs 78K at fitt 0, 105 tk/s)

@ngxson, some mmproj models seem to perform really well even when letting fit overshoot the VRAM a bit due to the mmproj. I think that's what @erazortt is referring to, since I still get 105 tk/s @ fitt 0 or on master.)

strawberrymelonpanda · 2026-04-05T23:13:19Z

Should --no-mmproj-offload remove it from the memory considerations?

with 24GB VRAM (3090) and Q4 Qwen 35B, I was getting around 75K context on master, with the new mmproj considerations I get around 30K. I thought I'd throw in "no offload" since I don't use it often (and was still getting 90 tk/s even with overflow on master), but seems the calculations are unchanged.

Perhaps it works differently than I imagined?

With mmproj and --no-mmproj-offload (29K context)

srv          load:   --no-mmproj-offload
...
[60955] srv    load_model: [mtmd] estimated memory usage of mmproj is 1109.08 MiB
[60955] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[60955] llama_params_fit_impl: projected to use 25934 MiB of device memory vs. 23239 MiB of free device memory
[60955] llama_params_fit_impl: cannot meet free memory target of 2133 MiB, need to reduce device memory by 4828 MiB
[60955] llama_params_fit_impl: context size reduced from 262144 to 28928 -> need 4832 MiB less memory in total
[60955] llama_params_fit_impl: entire model can be fit by reducing context
[60955] llama_params_fit: successfully fit params to free device memory

Without a mmproj file: (82K context)

[60745] llama_params_fit_impl: projected to use 25934 MiB of device memory vs. 23239 MiB of free device memory
[60745] llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 3719 MiB
[60745] llama_params_fit_impl: context size reduced from 262144 to 82432 -> need 3723 MiB less memory in total
[60745] llama_params_fit_impl: entire model can be fit by reducing context
[60745] llama_params_fit: successfully fit params to free device memory

erazortt · 2026-04-06T01:16:33Z

What if I want to purposely have the mmproj overflow into the ram when using unified flag?

I don't get what you mean. Can you provide your CLI arguments?

What I mean by that is that currently I like to set the fit target lower than the mmproj size. This has the effect to force the gpu to use the unified memory and move the unused parts to RAM. This means that mmproj gets pushed to RAM, until it is used. While mmproj is getting used (during image enciding) it is pushed back to VRAM until when the LLM takes over when it is pushed back to RAM. And the question was if that would still be possible to be done once this is merged.

ngxson · 2026-04-06T12:13:56Z

@erazortt still don't get what you want. But I will add a new flag --no-fit-mmproj that skip this PR's logic altogether

@strawberrymelonpanda seems like --no-mmproj-offload is not taken into account by this PR. I'll see to implement it

JohannesGaessler

In principle, if you could just load the mtmd stuff before the main model you would need to make no further changes since llama_params_fit will recognize that the memory usage has increased. However, I assume this won't just work due to the vocab issues.

JohannesGaessler · 2026-04-06T12:32:33Z

    static support_info_graph alloc_compute_meta(clip_ctx & ctx_clip, const clip_image_f32_batch & batch) {
        ctx_clip.buf_compute_meta.resize(ctx_clip.max_nodes * ggml_tensor_overhead() + ggml_graph_overhead());

+        // TODO @ngxson : prevent alloc if no_alloc is set


I assume the intent is to do this in a future PR? And that this isn't a forgotten TODO?

Not a forgotten TODO, but I'm attempting to implement this (no idea if I'm getting this right). I'll push a new commit so you can have a look

Nevermind, we do not allocation anything here, I was confused between ggml_backend_sched_reserve and ggml_backend_sched_alloc_graph

JohannesGaessler · 2026-04-06T12:37:09Z

+                SRV_ERR("%s", "[mtmd] failed to get memory usage of mmproj\n");
+            }
+            GGML_ASSERT(!params_base.fit_params_target.empty());
+            params_base.fit_params_target[0] += mmproj_mem;


I assume you know this better than me but does the mtmd library actually always use the first device?

It either use CPU or the first non-CPU device, I'm extending this logic to support --no-mmproj-offload

erazortt · 2026-04-06T22:39:08Z

Could you just adjust the fit target to lower than the default 1024MB?

What I am currently doing, for e.g. qwen 27b:
.\llama-cpp\llama-server.exe -m models/Qwen3.5-27B-Q6_K_L.gguf --mmproj models/Qwen3.5-27B-mmproj-bf16.gguf --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0 --presence-penalty 1.5 --repeat-penalty 1.0 -ngl 99 --fit-target 448M -ctk q8_0 -ctv q8_0 --jinja --reasoning off --flash-attn on --no-mmap --host 0.0.0.0 --port 10000

I am setting fit-target to 448M. If this is getting merged than I would probably be able to go to 0M.

@erazortt still don't get what you want. But I will add a new flag --no-fit-mmproj that skip this PR's logic altogether

I don't think an additional parameter is necessary. If you want to take my remark into consideration you could perhaps allow for negative fit-target settings. If you however do not want to take this into account (that would probably be understandable, since this is a fringe usecase), or the implementation it is too complicated, I do not think that it would be an issue for most of the users. Whoever wants to still do this can disable fitting.

erazortt · 2026-04-11T09:04:10Z

I have made more tests, and it appears that going too far to over the limit and having fit-size too small makes things very slow for big contexts. So please disregard what I suggested and merge this like it is.

mtmd: fit_params now take into account mmproj

e6386c7

ngxson requested a review from JohannesGaessler April 5, 2026 21:25

ngxson requested review from a team as code owners April 5, 2026 21:25

github-actions bot added examples server labels Apr 5, 2026

JohannesGaessler approved these changes Apr 6, 2026

View reviewed changes

Conversation

ngxson commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Additional information

Demo

Requirements

Uh oh!

erazortt commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ngxson commented Apr 5, 2026

Uh oh!

strawberrymelonpanda commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

strawberrymelonpanda commented Apr 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erazortt commented Apr 6, 2026

Uh oh!

ngxson commented Apr 6, 2026

Uh oh!

JohannesGaessler left a comment

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

JohannesGaessler Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

ngxson Apr 6, 2026

Choose a reason for hiding this comment

Uh oh!

erazortt commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erazortt commented Apr 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ngxson commented Apr 5, 2026 •

edited

Loading

erazortt commented Apr 5, 2026 •

edited

Loading

strawberrymelonpanda commented Apr 5, 2026 •

edited

Loading

strawberrymelonpanda commented Apr 5, 2026 •

edited

Loading

erazortt commented Apr 6, 2026 •

edited

Loading