mtmd: fit_params now take into account mmproj#21489
mtmd: fit_params now take into account mmproj#21489ngxson wants to merge 1 commit intoggml-org:masterfrom
Conversation
|
What if I want to purposely have the mmproj overflow into the ram when using unified flag? PS: and yes, this is usefull! Since in this case mmproj will be mostly in ram, apart from when it is actually needed. Yes, it takes some time for the swapping from ram to vram and back to take place, but it is faster then having it only in ram. this can be usefull if vision is used only unregularily. |
I don't get what you mean. Can you provide your CLI arguments? |
Could you just adjust the fit target to lower than the default 1024MB? Depending on the size of the mmproj, @ngxson, some mmproj models seem to perform really well even when letting fit overshoot the VRAM a bit due to the mmproj. I think that's what @erazortt is referring to, since I still get 105 tk/s @ fitt 0 or on master.) |
|
Should --no-mmproj-offload remove it from the memory considerations? with 24GB VRAM (3090) and Q4 Qwen 35B, I was getting around 75K context on master, with the new mmproj considerations I get around 30K. I thought I'd throw in "no offload" since I don't use it often (and was still getting 90 tk/s even with overflow on master), but seems the calculations are unchanged. Perhaps it works differently than I imagined? With mmproj and Without a mmproj file: (82K context) |
What I mean by that is that currently I like to set the fit target lower than the mmproj size. This has the effect to force the gpu to use the unified memory and move the unused parts to RAM. This means that mmproj gets pushed to RAM, until it is used. While mmproj is getting used (during image enciding) it is pushed back to VRAM until when the LLM takes over when it is pushed back to RAM. And the question was if that would still be possible to be done once this is merged. |
|
@erazortt still don't get what you want. But I will add a new flag @strawberrymelonpanda seems like |
JohannesGaessler
left a comment
There was a problem hiding this comment.
In principle, if you could just load the mtmd stuff before the main model you would need to make no further changes since llama_params_fit will recognize that the memory usage has increased. However, I assume this won't just work due to the vocab issues.
| static support_info_graph alloc_compute_meta(clip_ctx & ctx_clip, const clip_image_f32_batch & batch) { | ||
| ctx_clip.buf_compute_meta.resize(ctx_clip.max_nodes * ggml_tensor_overhead() + ggml_graph_overhead()); | ||
|
|
||
| // TODO @ngxson : prevent alloc if no_alloc is set |
There was a problem hiding this comment.
I assume the intent is to do this in a future PR? And that this isn't a forgotten TODO?
There was a problem hiding this comment.
Not a forgotten TODO, but I'm attempting to implement this (no idea if I'm getting this right). I'll push a new commit so you can have a look
There was a problem hiding this comment.
Nevermind, we do not allocation anything here, I was confused between ggml_backend_sched_reserve and ggml_backend_sched_alloc_graph
| SRV_ERR("%s", "[mtmd] failed to get memory usage of mmproj\n"); | ||
| } | ||
| GGML_ASSERT(!params_base.fit_params_target.empty()); | ||
| params_base.fit_params_target[0] += mmproj_mem; |
There was a problem hiding this comment.
I assume you know this better than me but does the mtmd library actually always use the first device?
There was a problem hiding this comment.
It either use CPU or the first non-CPU device, I'm extending this logic to support --no-mmproj-offload
What I am currently doing, for e.g. qwen 27b: I am setting fit-target to 448M. If this is getting merged than I would probably be able to go to 0M.
I don't think an additional parameter is necessary. If you want to take my remark into consideration you could perhaps allow for negative fit-target settings. If you however do not want to take this into account (that would probably be understandable, since this is a fringe usecase), or the implementation it is too complicated, I do not think that it would be an issue for most of the users. Whoever wants to still do this can disable fitting. |
|
I have made more tests, and it appears that going too far to over the limit and having fit-size too small makes things very slow for big contexts. So please disregard what I suggested and merge this like it is. |
Overview
Fix #19980 ; Fix #18181
--fitnow take into account the memory usage of mmproj and adjust--fit-targetaccordingly. This is effective onllama-serverandllama-clillama-mtmd-clidoesn't yet support it because it's mostly used for testing, but it can be added in the future.Additional information
A new API is added to libmtmd:
mtmd_get_memory_usage, which returns the memory usage in bytes. The returned value is the sum of:clip_model_loader::load_tensorsclip_model_loader::alloc_compute_metaIf model has both vision and audio encoder, they will all be taken into account.
Demo
Requirements