feat: Use xllamacpp to allow batching tasks and return reasoning content#258
Conversation
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
b7253d7 to
975a562
Compare
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
There was a problem hiding this comment.
I checked out the branch, loaded the venv and ran poetry install. When processing a task, the app throws this:
RuntimeError: llama-server exited with code 1 before becoming ready. Last output:
Traceback (most recent call last):
File "<string>", line 2, in <module>
import xllamacpp as xlc
File "/home/julien/vcs/git/llm2/.venv/lib/python3.14/site-packages/xllamacpp/__init__.py", line 15, in <module>
from .xllamacpp import *
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory
Is CUDA a strong requirement of xllamacpp?
I'm using COMPUTE_DEVICE=CPU btw.
Also, it would be nice to return the reasoning in the non-chat providers as well. Wdyt?
It's not but I haven't tested without CUDA, yet. Good point! |
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
e385a3e to
c7c3043
Compare
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
There was a problem hiding this comment.
- With Olmo, on CPU, I get this error with all task types:
0.02.736.601 E srv init: init: chat template parsing error: Unable to generate parser for this template. Automatic parser generation failed:
------------
While executing FilterExpression at line 6, column 86 in source:
... none -%}{{- '<functions>' -}}{{- tools | tojson -}}{{- '</functions>' -}}{%- el...
^
Error: Unknown (built-in) filter 'tojson' for type Undefined (hint: 'tools')
0.02.736.603 E srv init: init: please consider disabling jinja via --no-jinja, or use a custom chat template via --chat-template
0.02.736.604 E srv init: init: for example: --no-jinja --chat-template chatml
0.02.736.617 I srv operator(): operator(): cleaning up before exit...
0.02.737.377 E srv init: exiting due to model loading error
Traceback (most recent call last):
File "<string>", line 15, in <module>
server = xlc.Server(p) # starts the C++ server in a background thread
File "src/xllamacpp/xllamacpp.pyx", line 3070, in xllamacpp.xllamacpp.Server.__cinit__
RuntimeError: Failed to init server, please check the input params.
- Because we now stream the reasoning content, the message generation placeholder disappears while there is still no content to display. This will be fixed by adding the reasoning support in the assistant UI but right now it feels weird.
Other than that: works well!
Mh, good catch! that seems like a model file incompatibility issue :/ |
should also fix the prompt template issue with the old olmo version Signed-off-by: Marcel Klehr <mklehr@gmx.net>
|
No more But with Olmo-Think, the reasoning content is reported as content. Tried the same task type with the same prompt with Qwen and the reasoning was reported correctly. Also, i tried canceling a task and it seems llm2 is not stopping the process after reporting some intermediate output (the response from the |
Damn, confirmed. Mmmh, so either we can't use Olmo at all or only with Reasoning spilling out. :/
Ah, yes , that was not implemented. I can create a new PR once this is through. |
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
|
Olmo 3 Instruct should work now |
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
🤖 AI (if applicable)