Replies: 4 comments 5 replies
-
|
I'm the sole user of these models (at most an embedding and llm model at the same time). Usually just chatting back and forth with basic tool calls via MCP and the llama-server webui or Pi for cli. Request: MoE models that are larger than VRAM (let's focus on
I know file sizes and active parameters all impact the numbers and you can't expect apples to apples between gpt-oss model architecture and others; just want to make sure I'm not missing any dials to tweak. Thanks everyone for this amazing project! Additional Questions
System SetupRTX 3090 (24 GB VRAM) | 128 DDR5 6400 MT/s RAM | Intel Core Ultra 265k I use
Portions of my For embedding models I add this at each model level Llama-Bench Command |
Beta Was this translation helpful? Give feedback.
-
|
Great thread! I'm running Qwen 35B A3B on my RTX 2060 laptop (32GB RAM, 6 GB VRAM, i7 9750H, Windows 11) and this is the first model I am able to run at ridiculous amounts of context at great speeds.
This is the best configuration I have come up with. With that, the experts are running on the CPU while the other layers run on the GPU. ubatch 2048 gives a huge speedup to prompt processing which is sorely needed. This way I'm getting around 350-400 token/s prefill on 102K context and a text generation of around 15 token/s. So I'm very pleased how well it runs. So pure text generation is great. However, as you have noticed, I have offloaded the mmproj entirely on the CPU. Why? Because it needs around 600 MB VRAM and that would greatly reduce the effective context I am able to run. On the CPU it can be very slow, it takes up to 300 seconds on decently sized images like browser snapshots. I wonder if there is any way to either make the mmproj more efficient on the CPU or switch layers from GPU to RAM to free up VRAM for the vision encoder just before the vision processing automatically or with a command and then load them back on the GPU after the vision processing has been completed. |
Beta Was this translation helpful? Give feedback.
-
|
I might be using one of the more exotic setups :) But speeds aren't really that good despite the connectivity.
I typically execute like this: |
Beta Was this translation helpful? Give feedback.
-
|
i have an old crypto miner, its basically e-waste but it does run smaller models okeyish. i'am building a garbage multi agent chat interface, it sort of works as long as you only trigger 1-2 model at the same time hw: i currently run it like this : it can idle multiple models at the same time, and run active inference on 2 models without too much slowdown but as soon as you hit the third model everything slows down significantly. it's running different 4b models in Q4, or whatever fits in a single gpu's memory i was wondering if there is a way to reduce cpu load to be-able to run multiple models concurrently |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
-
Overview
Are you using
llama.cppand wondering if you are getting the most out of your hardware?Post your parameters below and get some help from the community to improve the performance. Sometimes, adjusting a few parameters can make a big difference in terms of speed and/or quality.
Information needed:
llama-servercommand that you are currently usingllama-bench, but could be something else depending on the use case)Beta Was this translation helpful? Give feedback.
All reactions