Skip to content

feat: Use xllamacpp to allow batching tasks and return reasoning content#258

Merged
marcelklehr merged 20 commits into
mainfrom
feat/llama-cpp-server
Jun 25, 2026
Merged

feat: Use xllamacpp to allow batching tasks and return reasoning content#258
marcelklehr merged 20 commits into
mainfrom
feat/llama-cpp-server

Conversation

@marcelklehr

@marcelklehr marcelklehr commented Jun 18, 2026

Copy link
Copy Markdown
Member
  • Switches llama-cpp-python with xllamacpp a thinner wrapper
  • Make processing async to allow parallel processing
  • Return reasoning content for all task types

🤖 AI (if applicable)

  • The content of this PR was partly or fully generated using AI

@marcelklehr marcelklehr changed the title feat: Use llama-cpp-server to allow batching tasks feat: Use llama-cpp-server to allow batching tasks and return reasoning content Jun 22, 2026
@marcelklehr marcelklehr changed the title feat: Use llama-cpp-server to allow batching tasks and return reasoning content feat: Use xllamacpp to allow batching tasks and return reasoning content Jun 22, 2026
@marcelklehr marcelklehr marked this pull request as ready for review June 22, 2026 12:06
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
@marcelklehr marcelklehr force-pushed the feat/llama-cpp-server branch from b7253d7 to 975a562 Compare June 22, 2026 12:08
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>

@julien-nc julien-nc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked out the branch, loaded the venv and ran poetry install. When processing a task, the app throws this:

RuntimeError: llama-server exited with code 1 before becoming ready. Last output:
Traceback (most recent call last):
  File "<string>", line 2, in <module>
    import xllamacpp as xlc
  File "/home/julien/vcs/git/llm2/.venv/lib/python3.14/site-packages/xllamacpp/__init__.py", line 15, in <module>
    from .xllamacpp import *
ImportError: libcudart.so.12: cannot open shared object file: No such file or directory

Is CUDA a strong requirement of xllamacpp?
I'm using COMPUTE_DEVICE=CPU btw.

Also, it would be nice to return the reasoning in the non-chat providers as well. Wdyt?

@marcelklehr

Copy link
Copy Markdown
Member Author

Is CUDA a strong requirement of xllamacpp?

It's not but I haven't tested without CUDA, yet. Good point!

Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
@marcelklehr marcelklehr force-pushed the feat/llama-cpp-server branch from e385a3e to c7c3043 Compare June 24, 2026 07:48
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>
Signed-off-by: Marcel Klehr <mklehr@gmx.net>

@julien-nc julien-nc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • With Olmo, on CPU, I get this error with all task types:
0.02.736.601 E srv          init: init: chat template parsing error: Unable to generate parser for this template. Automatic parser generation failed:
------------
While executing FilterExpression at line 6, column 86 in source:
... none -%}{{- '<functions>' -}}{{- tools | tojson -}}{{- '</functions>' -}}{%- el...
                                           ^
Error: Unknown (built-in) filter 'tojson' for type Undefined (hint: 'tools')
0.02.736.603 E srv          init: init: please consider disabling jinja via --no-jinja, or use a custom chat template via --chat-template
0.02.736.604 E srv          init: init: for example: --no-jinja --chat-template chatml
0.02.736.617 I srv    operator(): operator(): cleaning up before exit...
0.02.737.377 E srv          init: exiting due to model loading error
Traceback (most recent call last):
  File "<string>", line 15, in <module>
    server = xlc.Server(p)  # starts the C++ server in a background thread
  File "src/xllamacpp/xllamacpp.pyx", line 3070, in xllamacpp.xllamacpp.Server.__cinit__
RuntimeError: Failed to init server, please check the input params.
  • Because we now stream the reasoning content, the message generation placeholder disappears while there is still no content to display. This will be fixed by adding the reasoning support in the assistant UI but right now it feels weird.

Other than that: works well!

@marcelklehr

Copy link
Copy Markdown
Member Author

Error: Unknown (built-in) filter 'tojson' for type Undefined (hint: 'tools')

Mh, good catch! that seems like a model file incompatibility issue :/

should also fix the prompt template issue with the old olmo version

Signed-off-by: Marcel Klehr <mklehr@gmx.net>
@julien-nc

julien-nc commented Jun 24, 2026

Copy link
Copy Markdown
Member

No more chat template parsing error with Olmo-Think!

But with Olmo-Think, the reasoning content is reported as content. Tried the same task type with the same prompt with Qwen and the reasoning was reported correctly.

Also, i tried canceling a task and it seems llm2 is not stopping the process after reporting some intermediate output (the response from the /stream-result endpoint contains the task with its status). I don't think i tested that in the PR that added the streaming support. Can you have a look? Bug or just something missing to support cancelling?

@marcelklehr

Copy link
Copy Markdown
Member Author

But with Olmo-Think, the reasoning content is reported as content.

Damn, confirmed. Mmmh, so either we can't use Olmo at all or only with Reasoning spilling out. :/

canceling a task and it seems llm2 is not stopping the process after reporting some intermediate output

Ah, yes , that was not implemented. I can create a new PR once this is through.

Signed-off-by: Marcel Klehr <mklehr@gmx.net>
@marcelklehr

Copy link
Copy Markdown
Member Author

Olmo 3 Instruct should work now

Signed-off-by: Marcel Klehr <mklehr@gmx.net>
@julien-nc julien-nc self-requested a review June 25, 2026 12:19
@marcelklehr marcelklehr merged commit 1a391f3 into main Jun 25, 2026
8 checks passed
@marcelklehr marcelklehr deleted the feat/llama-cpp-server branch June 25, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants