Skip to content

mtmd: fit_params now take into account mmproj#21489

Open
ngxson wants to merge 1 commit intoggml-org:masterfrom
ngxson:xsn/mmproj_fit_params
Open

mtmd: fit_params now take into account mmproj#21489
ngxson wants to merge 1 commit intoggml-org:masterfrom
ngxson:xsn/mmproj_fit_params

Conversation

@ngxson
Copy link
Copy Markdown
Contributor

@ngxson ngxson commented Apr 5, 2026

Overview

Fix #19980 ; Fix #18181

--fit now take into account the memory usage of mmproj and adjust --fit-target accordingly. This is effective on llama-server and llama-cli

llama-mtmd-cli doesn't yet support it because it's mostly used for testing, but it can be added in the future.

Additional information

A new API is added to libmtmd: mtmd_get_memory_usage, which returns the memory usage in bytes. The returned value is the sum of:

  • Memory needed by mmproj weight, ref: clip_model_loader::load_tensors
  • Compute buffer, ref: clip_model_loader::alloc_compute_meta

If model has both vision and audio encoder, they will all be taken into account.

Demo

llamac llama-server -hf ggml-org/Qwen2.5-Omni-3B-GGUF
...
srv    load_model: [mtmd] estimated memory usage of mmproj is 2447.80 MiB
common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
llama_params_fit_impl: projected to use 3287 MiB of device memory vs. 110100 MiB of free device memory
llama_params_fit_impl: will leave 106812 >= 3471 MiB of free device memory, no changes needed
llama_params_fit: successfully fit params to free device memory
llama_params_fit: fitting params to free memory took 0.21 seconds
llama_model_load_from_file_impl: using device MTL0 (Apple M5 Max) (unknown id) - 110100 MiB free
...

Requirements

@ngxson ngxson requested a review from JohannesGaessler April 5, 2026 21:25
@ngxson ngxson requested review from a team as code owners April 5, 2026 21:25
@erazortt
Copy link
Copy Markdown

erazortt commented Apr 5, 2026

What if I want to purposely have the mmproj overflow into the ram when using unified flag?

PS: and yes, this is usefull! Since in this case mmproj will be mostly in ram, apart from when it is actually needed. Yes, it takes some time for the swapping from ram to vram and back to take place, but it is faster then having it only in ram. this can be usefull if vision is used only unregularily.

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Apr 5, 2026

What if I want to purposely have the mmproj overflow into the ram when using unified flag?

I don't get what you mean. Can you provide your CLI arguments?

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Apr 5, 2026

What if I want to purposely have the mmproj overflow into the ram when using unified flag?

Could you just adjust the fit target to lower than the default 1024MB?

Depending on the size of the mmproj, --fitt 512 or even --fitt 0 comes close, though not exact.
(Qwen3 35B at 82K without an mmproj vs 78K at fitt 0, 105 tk/s)

@ngxson, some mmproj models seem to perform really well even when letting fit overshoot the VRAM a bit due to the mmproj. I think that's what @erazortt is referring to, since I still get 105 tk/s @ fitt 0 or on master.)

@strawberrymelonpanda
Copy link
Copy Markdown
Contributor

strawberrymelonpanda commented Apr 5, 2026

Should --no-mmproj-offload remove it from the memory considerations?

with 24GB VRAM (3090) and Q4 Qwen 35B, I was getting around 75K context on master, with the new mmproj considerations I get around 30K. I thought I'd throw in "no offload" since I don't use it often (and was still getting 90 tk/s even with overflow on master), but seems the calculations are unchanged.

Perhaps it works differently than I imagined?

With mmproj and --no-mmproj-offload (29K context)

srv          load:   --no-mmproj-offload
...
[60955] srv    load_model: [mtmd] estimated memory usage of mmproj is 1109.08 MiB
[60955] common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on
[60955] llama_params_fit_impl: projected to use 25934 MiB of device memory vs. 23239 MiB of free device memory
[60955] llama_params_fit_impl: cannot meet free memory target of 2133 MiB, need to reduce device memory by 4828 MiB
[60955] llama_params_fit_impl: context size reduced from 262144 to 28928 -> need 4832 MiB less memory in total
[60955] llama_params_fit_impl: entire model can be fit by reducing context
[60955] llama_params_fit: successfully fit params to free device memory

Without a mmproj file: (82K context)

[60745] llama_params_fit_impl: projected to use 25934 MiB of device memory vs. 23239 MiB of free device memory
[60745] llama_params_fit_impl: cannot meet free memory target of 1024 MiB, need to reduce device memory by 3719 MiB
[60745] llama_params_fit_impl: context size reduced from 262144 to 82432 -> need 3723 MiB less memory in total
[60745] llama_params_fit_impl: entire model can be fit by reducing context
[60745] llama_params_fit: successfully fit params to free device memory

@erazortt
Copy link
Copy Markdown

erazortt commented Apr 6, 2026

What if I want to purposely have the mmproj overflow into the ram when using unified flag?

I don't get what you mean. Can you provide your CLI arguments?

What I mean by that is that currently I like to set the fit target lower than the mmproj size. This has the effect to force the gpu to use the unified memory and move the unused parts to RAM. This means that mmproj gets pushed to RAM, until it is used. While mmproj is getting used (during image enciding) it is pushed back to VRAM until when the LLM takes over when it is pushed back to RAM. And the question was if that would still be possible to be done once this is merged.

@ngxson
Copy link
Copy Markdown
Contributor Author

ngxson commented Apr 6, 2026

@erazortt still don't get what you want. But I will add a new flag --no-fit-mmproj that skip this PR's logic altogether

@strawberrymelonpanda seems like --no-mmproj-offload is not taken into account by this PR. I'll see to implement it

Copy link
Copy Markdown
Contributor

@JohannesGaessler JohannesGaessler left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In principle, if you could just load the mtmd stuff before the main model you would need to make no further changes since llama_params_fit will recognize that the memory usage has increased. However, I assume this won't just work due to the vocab issues.

Comment thread tools/mtmd/clip.cpp
static support_info_graph alloc_compute_meta(clip_ctx & ctx_clip, const clip_image_f32_batch & batch) {
ctx_clip.buf_compute_meta.resize(ctx_clip.max_nodes * ggml_tensor_overhead() + ggml_graph_overhead());

// TODO @ngxson : prevent alloc if no_alloc is set
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume the intent is to do this in a future PR? And that this isn't a forgotten TODO?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not a forgotten TODO, but I'm attempting to implement this (no idea if I'm getting this right). I'll push a new commit so you can have a look

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nevermind, we do not allocation anything here, I was confused between ggml_backend_sched_reserve and ggml_backend_sched_alloc_graph

SRV_ERR("%s", "[mtmd] failed to get memory usage of mmproj\n");
}
GGML_ASSERT(!params_base.fit_params_target.empty());
params_base.fit_params_target[0] += mmproj_mem;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume you know this better than me but does the mtmd library actually always use the first device?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It either use CPU or the first non-CPU device, I'm extending this logic to support --no-mmproj-offload

@erazortt
Copy link
Copy Markdown

erazortt commented Apr 6, 2026

Could you just adjust the fit target to lower than the default 1024MB?

What I am currently doing, for e.g. qwen 27b:
.\llama-cpp\llama-server.exe -m models/Qwen3.5-27B-Q6_K_L.gguf --mmproj models/Qwen3.5-27B-mmproj-bf16.gguf --temp 0.7 --top-k 20 --top-p 0.8 --min-p 0 --presence-penalty 1.5 --repeat-penalty 1.0 -ngl 99 --fit-target 448M -ctk q8_0 -ctv q8_0 --jinja --reasoning off --flash-attn on --no-mmap --host 0.0.0.0 --port 10000

I am setting fit-target to 448M. If this is getting merged than I would probably be able to go to 0M.

@erazortt still don't get what you want. But I will add a new flag --no-fit-mmproj that skip this PR's logic altogether

I don't think an additional parameter is necessary. If you want to take my remark into consideration you could perhaps allow for negative fit-target settings. If you however do not want to take this into account (that would probably be understandable, since this is a fringe usecase), or the implementation it is too complicated, I do not think that it would be an issue for most of the users. Whoever wants to still do this can disable fitting.

@erazortt
Copy link
Copy Markdown

I have made more tests, and it appears that going too far to over the limit and having fit-size too small makes things very slow for big contexts. So please disregard what I suggested and merge this like it is.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

4 participants