Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results by MegaStood · Pull Request #335 · intel/llm-scaler

MegaStood · 2026-03-27T03:20:55Z

Summary

Adds full Lunar Lake (Intel Core Ultra 7 258V / Arc 140V iGPU / 32GB LPDDR5x) support for vLLM SYCL, tested on MSI Claw 8 AI+ running Nobara 43.

Changes

install_lunar_lake.sh: Fix Python 3.14 compat, PyTorch XPU pinning, MKL library loading, Level-Zero ordering, xpu_worker.py CCL patch for single-GPU, MAX_JOBS=6 to prevent OOM during kernel compilation
lunar_lake_serve.sh: Add CCL single-GPU workaround (WiFi-only devices have no wired Ethernet), MKL preload fix, memory detection
Dockerfile.lunar-lake: Add CCL env vars and xpu_worker patch
LUNAR_LAKE_COMPATIBILITY.md: Document Xe2-specific blockers (Triton XPU broken, Marlin CUDA-only), model compatibility matrix, and benchmark results

Benchmark Results (Qwen3-8B INT4 AutoRound on Arc 140V)

Single-request: 17.6 tok/s generation
Batched peak: 90 tok/s (5 concurrent short requests)
Long context (4K in + 2K out, 10 reqs): 50.2 tok/s output, 150.5 tok/s total throughput
Model loads in 7s using only 5.69 GiB

Key Findings

Qwen3 (standard attention) works on Xe2; Qwen3.5 (fla/linear attention) does not (Triton XPU broken)
AWQ/GPTQ compressed-tensors fail due to CUDA-only Marlin kernels
AutoRound INT4 works via IPEX quantization fallback for models ≤8B
Online quantization is the most reliable path on shared-memory systems

Test plan

Verified install_lunar_lake.sh completes on Nobara 43
Verified vLLM serves Qwen3-8B INT4 AutoRound on XPU
Benchmarked with vllm bench serve (3 workloads)
Confirmed CCL single-GPU workaround on WiFi-only device
Test on other Lunar Lake devices (different OEMs)
Test Docker path on Lunar Lake

Documents incompatibility — this project targets discrete Arc Pro B60 GPUs with SYCL/oneAPI, not integrated Xe2 iGPUs on Lunar Lake. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

SYCL ESIMD kernels are fully portable to Xe2 (no hardcoded device IDs or BMG-specific constants). This adds the infrastructure adaptations: - Dockerfile.lunar-lake: lightweight single-GPU image for iGPU - lunar_lake_serve.sh: memory-aware vLLM launch with shared memory config - lunar_lake_evaluation.sh: iGPU platform evaluation (skip P2P/CCL) - platform_basic_evaluation.sh: detect Lunar Lake iGPU alongside B60 - LUNAR_LAKE_COMPATIBILITY.md: full docs with quick start and model recs Key settings for Lunar Lake: TP=1, CCL_TOPO_P2P_ACCESS=0, gpu-memory-utilization=0.7, enforce-eager, INT4 quantization recommended. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

One-command installer that handles: - oneAPI Base Toolkit via DNF - Level-Zero for Xe2 iGPU - PyTorch XPU in a venv - Patched vLLM build from source - vllm-xpu-kernels + triton-xpu - Bash aliases for quick launch (vllm-serve) No Docker required. Targets Nobara 43 / Fedora 42+. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Level-Zero packages aren't in Nobara/Fedora default repos. Now: - Adds Intel oneAPI + compute-runtime repos first - Tries multiple Level-Zero package names (varies across distros) - Uses --skip-unavailable to handle missing packages gracefully - Falls back to checking if libze_loader exists from xe driver https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Two issues on Nobara 43: 1. oneAPI setvars.sh hangs when MPI probes network interfaces over SSH. Fix: set ONEAPI_SETVARS_MPI_INSTALL=0 before sourcing. 2. Nobara 43 ships Python 3.14 but PyTorch XPU only has wheels for <=3.12. Fix: detect version, auto-install python3.12 via dnf if needed. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

setvars.sh has unbound variables and non-zero exits internally that trigger our set -euo pipefail, silently killing the script at Phase 2. Fix: temporarily set +euo pipefail around the source call, pipe output through grep to show only component init lines, then restore strict mode. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

The pipe operator causes `source` to run in a subshell, so all oneAPI environment variables were lost in the parent shell. Redirect to a temp file instead and grep it separately. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

The +xpu version suffix doesn't exist on PyPI. XPU wheels are served from a separate index URL. Also remove tail pipe so download progress is visible. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Use exact version pins from official Intel XPU install docs to avoid pulling incompatible versions. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

PyTorch XPU needs libze_intel_gpu.so (the GPU userspace driver) to talk to the xe kernel driver via Level-Zero. This is provided by the intel-compute-runtime package, which was missing from the install list. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

All the | tail -N pipes were hiding pip install and build output, making the script appear frozen during long operations. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Running from ~/llm-scaler which has a vllm/ subdirectory causes 'import vllm' to succeed even when vLLM is not installed, skipping the entire build phase. Use 'pip show vllm' instead. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Show full build output so errors are visible. Also clean up stale vllm-xpu-kernels directory if previous build failed. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Preload oneAPI MKL libraries (LD_PRELOAD) in both serve and install scripts to fix "Cannot load libmkl_core.so.2" caused by PyTorch's broken relative RPATH in venvs. Set MAX_JOBS=2 for xpu-kernels build to prevent OOM kills on 32GB shared-memory systems. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

- Change MAX_JOBS from 2 to 6 (75% CPU, ~24GB peak RAM, safe for 32GB) - Make MAX_JOBS overridable via environment variable - Add log message warning that 933 SYCL files take 1.5-2 hours on Lunar Lake - Add comment reminding users to plug in before building https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…unar Lake Real-world testing on MSI Claw 8 AI+ (Arc 140V, 32GB shared) revealed: - oneCCL KVS init fails on WiFi-only devices ("can't find non-loopback interface") Added CCL_ZE_ENABLE=0, CCL_ATL_TRANSPORT=ofi, FI_PROVIDER=tcp workaround - --device xpu is not a valid vLLM CLI flag; device must be set via VLLM_TARGET_DEVICE=xpu environment variable - xpu_worker.py all_reduce warmup must be patched out for single-GPU (TP=1) - Pre-quantized AutoRound/GPTQ 35B models OOM during weight loading on 32GB shared memory (peak memory doubles due to INT4→FP16 unpacking) - vllm-xpu-kernels build takes 1.5-2 hours on Lunar Lake (933 SYCL files) - GPU crash (DEVICE_LOST) after OOM requires full system reboot Updated: install_lunar_lake.sh, lunar_lake_serve.sh, Dockerfile.lunar-lake, LUNAR_LAKE_COMPATIBILITY.md https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…g results Add detailed model compatibility matrix based on real testing on MSI Claw 8 (Core Ultra 7 258V, Arc 140V). Key findings: - Triton XPU backend broken on Xe2 (blocks Qwen3.5 fla/linear attention) - Marlin kernels CUDA-only (blocks AWQ/GPTQ compressed-tensors) - Pre-quantized INT4 doubles peak memory (blocks >14B models on 32GB) - Only FP16 base + online quantization works reliably https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…timeouts) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Tested Intel/Qwen3-8B-int4-AutoRound on MSI Claw 8 (32GB LPDDR5x): - Single-request: 17.6 tok/s generation - Batched peak: 90 tok/s (5 concurrent short requests) - Long context (4K+2K): 50.2 tok/s output, 150.5 tok/s total - Model loads in 7s using only 5.69 GiB https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…ility-CB5w6 Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results

…y 5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Documented working setup for Qwen3-TTS-12Hz-1.7B-Base on Arc 140V: - Venv creation with shared XPU PyTorch from vLLM install - transformers==4.57.3 pinning (newer versions break qwen-tts) - Voice cloning example with generate_voice_clone() API - Tested and confirmed working on MSI Claw 8 AI+ https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

- Add consolidated "Running Recipes" section with LLM/ASR/TTS commands and memory budget table - Add Qwen3-ASR-1.7B setup, serve, and test documentation - Use 127.0.0.1 (localhost) since OpenClaw accesses services locally - ASR uses 0.25 GPU utilization (~7.2GB) for efficient memory usage https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Qwen3-30B-A3B GPTQ INT4 routes to IPEX (bypasses CUDA Marlin) but OOMs during MoE expert weight shuffle at 15.7 GiB — GPU enters DEVICE_LOST state. Confirms 30B MoE models don't fit on 32GB shared memory. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

- New install script install_meteor_arrow_lake.sh supporting: - Meteor Lake Xe-LPG (Core Ultra 155H/135H): PCI 7d55, 7dd5, 7d40, 7d45 - Arrow Lake Xe-LPG+ (Core Ultra 255H/245H): PCI 7d51, 7dd1, 7d41, 7d67 - Auto-detects platform and GPU via PCI device IDs - Warns Meteor Lake users to switch from i915 to xe driver - Adjusts memory recommendations based on system RAM (16-96GB) - Updated evaluation script detect_gpu() for all three platforms - Added platform compatibility table to LUNAR_LAKE_COMPATIBILITY.md https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

… transformers 5.x catch-22 - Add server-side engine log observations: generation throughput decay with context length, prefix cache hit rates, KV cache usage per concurrent request - Add Qwen3.5-4B AutoRound failure: multimodal model hits transformers 5.x Qwen2VLImageProcessor.max_pixels API break - Add LFM2-24B-A2B AWQ failure: custom Liquid AI tokenizer unsupported - Document transformers version catch-22: 4.x can't recognize new architectures, 5.x breaks vLLM multimodal code - Add critical blocker: transformers 5.x vs vLLM mismatch for multimodal models https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…rc.patch Intel's llm-scaler Docker image uses transformers from git HEAD (5.x) with vllm_for_multi_arc.patch applied, which adds full Qwen3.5 architecture support (min_transformers_version="5.1.0"). The max_pixels AttributeError is caused by transformers 5.x renaming image_processor.max_pixels to size["longest_edge"]. Added one-line getattr() fix for native installs without the full patch. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Confirmed that Qwen3.5-4B AutoRound INT4 successfully loads (3.68 GiB — well within Lunar Lake's 28.6 GB budget) after applying the max_pixels getattr() fix, but crashes during warmup on the Triton kernel in fla/ops/layernorm_guard.py. Even Intel's forward_xpu code path for Qwen3.5 routes through fla/ops layernorm which requires @triton.jit. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Highlights the sweet spot tradeoff: 0.35 has best batched TTFT (664ms) but only 5% KV headroom over 32K context. 0.42 provides 57% headroom with identical single-user performance. 0.8 has 22s batched TTFT due to massive KV cache management overhead. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

AWQ: compressed-tensors MoE path hits Marlin (dense XPU works fine). AutoRound: IPEX routing works correctly but OOM DEVICE_LOST during weight init — 27B model too large even at INT4 with 32GB swap. Updated Qwen3-Coder-30B-A3B AWQ with same root cause. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…ctive) 17.66 GiB at BF16, ~5 tok/s single-user (205ms TPOT) — 4-5x slower than Qwen3.5-4B. Not viable for interactive chat on Lunar Lake. Added benchmark section and model table entry. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…ded) Re-ran all Qwen3-8B INT4 benchmarks to verify XPU acceleration after triton-xpu patch. Results match previous run within noise. Added 2048/2048 context data points. Updated comparison table with verified numbers. Key results: 18.7 tok/s (128), 13.7 tok/s (1K), 13.3 tok/s (2K) single-user. Batched: 90/80/75 peak tok/s. 116,672 token KV cache at 0.8 util. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…ster than BF16) FP8 online quantization: 11.22 GiB (vs 17.66 GiB BF16), 117ms TPOT batched. Still too slow for interactive chat (~8.5 tok/s vs 4B's 23 tok/s). Documents FP8 as only working online quantization on XPU native install. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

sym_int4 now works on native XPU install by building vllm_int4_for_multi_arc.so from intel/BigDL-core source. Documents full build process for both CPU-side quantizer and GPU-side fused INT4 GEMM kernel, plus required IPEX patch for PyTorch 2.10 compatibility. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Key finding: VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1 is the critical enabler for sym_int4 on Lunar Lake, NOT the BigDL-core .so files. vLLM has built-in Python INT4 quantization. sym_int4 achieves 14.7 tok/s single-user (2.9x faster than BF16, 1.7x faster than FP8) at only 8.11 GiB (54% smaller). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Found the actual source code in intel/llm-scaler vllm_for_multi_arc.patch: sym_int4.py loads vllm_int4_for_multi_arc.so via ctypes.CDLL() from /opt/lib/ (hardcoded default). There is no Python fallback - the C lib is required. VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1 controls CPU-side weight loading to avoid GPU OOM during quantization. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

The vllm_int4_for_multi_arc.so is just a renamed libquantize.so from BigDL-core's GGML quantizer - a 12KB pure C library. Intel only ships it inside 10+ GB Docker images. Building from source takes 10 seconds. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Single-user: 14.7 tok/s at 128, 10.6 at 1024, 10.5 at 2048 — remarkably stable. Batched: 37.15 tok/s at 1024/1024, 34.21 at 2048/2048. sym_int4 is 2.3-3.2x faster than BF16 across all context lengths. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

- gpt-oss-20b (MXFP4, 13.27 GiB) hangs during warmup on XPU - Document ~10B practical limit for sym_int4 on 32GB shared memory - Add sym_int4 9B entry to recommended models table - Note IPEX marlin only covers MXFP4, not GPTQ/AWQ - Link to vLLM #30359 QeRL RFC for AutoRound loading peak fix https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Documents real-world impact of INT4→FP16 unpack memory spike on 32GB shared memory iGPU platforms. Includes sym_int4 workaround benchmarks and explains why layerwise weight processing would solve the problem. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Both AutoRound INT4 and sym_int4 use layer-by-layer process_weights_after_loading, so peak ≈ initial load + one layer overhead, not old+new format for entire model. Updated vLLM #30359 comment with corrected numbers and real OOM root causes (profile_run, GatedMLPMOE copies, HF deserialization overhead). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

… init vLLM's XPU worker runs a dummy forward pass (profile_run()) to measure peak memory for KV cache sizing. On Lunar Lake iGPU (Arc 140V / Xe2), this forward pass hangs indefinitely at 100% CPU, making the server unresponsive. Add VLLM_SKIP_PROFILE_RUN=1 env var to bypass the dummy forward pass and estimate peak memory from current allocation + 20% overhead instead. Includes the xpu_worker.py patch and updates lunar_lake_serve.sh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Previous patch was exported from vllm source repo which only had the VLLM_FALLBACK_PROFILE refactoring but not the VLLM_SKIP_PROFILE_RUN skip logic. Re-exported from the venv's installed copy which has the complete fix including the profile_run() bypass on Lunar Lake. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Reusable recipe: single-user (128/1024/2048) + batched (5 concurrent). Usage: ./benchmark.sh <model_path> <port> https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

- Move gpt-oss-20b from "Does NOT Work" to "Recommended Models" (resolved) - Add VLLM_SKIP_PROFILE_RUN=1 to environment variables section - Add full running recipe with quick start, manual launch, expected startup log, and key notes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

First MoE model successfully running on Lunar Lake iGPU via VLLM_SKIP_PROFILE_RUN=1 patch. 13.27 GiB MXFP4 with ipex marlin backend. 1.53-1.63x faster than 9B sym_int4 in single-user decode. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

plumlis · 2026-03-31T14:47:24Z

I love this.
Does dockfile is LNL only? No ARL-H and MTL-H support?

- Upgrade recipe from 8K to 32K context (88,576 KV cache tokens auto-allocated) - Remove --num-gpu-blocks-override (not needed, vLLM auto-estimates correctly) - Add tool calling (--tool-call-parser openai) and reasoning (--reasoning-parser openai_gptoss) - Document 3 thinking levels (low/medium/high) via system prompt - Fix GLM-4.7 entry: 30B-A3B MoE, not 27B dense - Correct memory analysis: layer-by-layer processing (not 2x bulk), ~13 GiB practical ceiling - Fix stale "2x memory spike" references throughout limitations section https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…s test - Re-benchmarked with production config: --max-model-len 32768, tool calling, reasoning parser (openai_gptoss) - Added 16K×5 concurrent stress test: 48.2% KV cache, no OOM - Confirmed --num-gpu-blocks-override not needed (88,576 tokens auto-allocated) - 32K config has no measurable perf impact vs 8K config - Fixed comparison table with updated numbers https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

- 16384/512 single-user: TPOT 63.5ms (15.7 tok/s), slight KV bandwidth slowdown - Prefix cache effect documented: 398ms TTFT with warm cache vs ~9-15s cold - KV cache usage: 9.6% single-user, 48.2% at 5-concurrent https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Cold start (0% cache): 20,337ms TTFT for 16K input (8 chunked prefill steps) Warm cache (57.7% hit): 398ms TTFT — 51x faster with prefix caching Decode speed identical: 64ms TPOT regardless of cache state https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

- Promote gpt-oss-20b as default OpenClaw agent model (replaces Qwen3.5-4B) - Reorganize running recipes: gpt-oss-20b primary, Qwen3.5-4B fallback - Update memory budget table with gpt-oss-20b configurations - Fix stale "Qwen3.5-4B remains best" verdict - Correct cold-start TTFT reference (confirmed 20.3s, not estimated 9-15s) - Update date to 2026-04-01 https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

…ility-CB5w6 MSI Claw 8 AI+: a feasible OpenClaw local running device

MegaStood · 2026-04-01T15:34:00Z

actually, i build it from the dev version and did not use the released version. it takes around 3 hours to build. Will try to figure out if it could run on 155H and 255H.

MegaStood · 2026-04-02T12:58:34Z

Closing this PR — split into focused PRs for easier review:

#334 — GLM-4.7-Flash MLA bug analysis and patches
#340 — VLLM_SKIP_PROFILE_RUN patch for iGPU profile_run() hang
#342 — Lunar Lake Xe2 compatibility report and benchmarks

claude and others added 30 commits March 20, 2026 09:57

Add Lunar Lake Xe2 140V compatibility report

1ef4516

Documents incompatibility — this project targets discrete Arc Pro B60 GPUs with SYCL/oneAPI, not integrated Xe2 iGPUs on Lunar Lake. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Fix PyTorch XPU install: use correct index URL without +xpu suffix

9aedeff

The +xpu version suffix doesn't exist on PyPI. XPU wheels are served from a separate index URL. Also remove tail pipe so download progress is visible. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Pin PyTorch XPU to 2.10.0 with matching torchvision/torchaudio versions

efbf626

Use exact version pins from official Intel XPU install docs to avoid pulling incompatible versions. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Remove tail pipes so build progress is visible

a235798

All the | tail -N pipes were hiding pip install and build output, making the script appear frozen during long operations. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Remove remaining tail pipes and retry stale xpu-kernels build

adf3638

Show full build output so errors are visible. Also clean up stale vllm-xpu-kernels directory if previous build failed. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

bench: add vllm benchmark results 2026-03-27 (13.6 tok/s, qwen3-8b-int4)

8840b82

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bench: add concurrency 2 and 5 benchmark results (25.5 / 19.3 tok/s)

fcfb119

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

bench: add concurrency 5 capped test + final summary (36.4 tok/s, no …

e6262f3

…timeouts) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Merge pull request #1 from MegaStood/claude/check-lunar-lake-compatib…

56ae5a5

…ility-CB5w6 Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results

bench: add Qwen3-ASR-1.7B benchmark results (81.9 tok/s at concurrenc…

505b3cf

…y 5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

claude and others added 19 commits March 30, 2026 13:15

Add pre-built vllm_int4_for_multi_arc.so for Lunar Lake sym_int4

bdfa16f

Add standard benchmark script for Lunar Lake vLLM testing

e33dbe2

Reusable recipe: single-user (128/1024/2048) + batched (5 concurrent). Usage: ./benchmark.sh <model_path> <port> https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

claude and others added 7 commits April 1, 2026 01:16

Fix KV cache concurrency: 88,576 / 32,768 = ~2.7x, not 5x

5796fe1

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34

Merge pull request #2 from MegaStood/claude/check-lunar-lake-compatib…

c0065fc

…ility-CB5w6 MSI Claw 8 AI+: a feasible OpenClaw local running device

MegaStood closed this Apr 2, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results#335

Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results#335
MegaStood wants to merge 66 commits intointel:mainfrom
MegaStood:main

MegaStood commented Mar 27, 2026

Uh oh!

plumlis commented Mar 31, 2026

Uh oh!

MegaStood commented Apr 1, 2026

Uh oh!

MegaStood commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

MegaStood commented Mar 27, 2026

Summary

Changes

Benchmark Results (Qwen3-8B INT4 AutoRound on Arc 140V)

Key Findings

Test plan

Uh oh!

plumlis commented Mar 31, 2026

Uh oh!

MegaStood commented Apr 1, 2026

Uh oh!

MegaStood commented Apr 2, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants