Skip to content

Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results#335

Closed
MegaStood wants to merge 66 commits intointel:mainfrom
MegaStood:main
Closed

Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results#335
MegaStood wants to merge 66 commits intointel:mainfrom
MegaStood:main

Conversation

@MegaStood
Copy link
Copy Markdown

Summary

Adds full Lunar Lake (Intel Core Ultra 7 258V / Arc 140V iGPU / 32GB LPDDR5x) support for vLLM SYCL, tested on MSI Claw 8 AI+ running Nobara 43.

Changes

  • install_lunar_lake.sh: Fix Python 3.14 compat, PyTorch XPU pinning, MKL library loading, Level-Zero ordering, xpu_worker.py CCL patch for single-GPU, MAX_JOBS=6 to prevent OOM during kernel compilation
  • lunar_lake_serve.sh: Add CCL single-GPU workaround (WiFi-only devices have no wired Ethernet), MKL preload fix, memory detection
  • Dockerfile.lunar-lake: Add CCL env vars and xpu_worker patch
  • LUNAR_LAKE_COMPATIBILITY.md: Document Xe2-specific blockers (Triton XPU broken, Marlin CUDA-only), model compatibility matrix, and benchmark results

Benchmark Results (Qwen3-8B INT4 AutoRound on Arc 140V)

  • Single-request: 17.6 tok/s generation
  • Batched peak: 90 tok/s (5 concurrent short requests)
  • Long context (4K in + 2K out, 10 reqs): 50.2 tok/s output, 150.5 tok/s total throughput
  • Model loads in 7s using only 5.69 GiB

Key Findings

  • Qwen3 (standard attention) works on Xe2; Qwen3.5 (fla/linear attention) does not (Triton XPU broken)
  • AWQ/GPTQ compressed-tensors fail due to CUDA-only Marlin kernels
  • AutoRound INT4 works via IPEX quantization fallback for models ≤8B
  • Online quantization is the most reliable path on shared-memory systems

Test plan

  • Verified install_lunar_lake.sh completes on Nobara 43
  • Verified vLLM serves Qwen3-8B INT4 AutoRound on XPU
  • Benchmarked with vllm bench serve (3 workloads)
  • Confirmed CCL single-GPU workaround on WiFi-only device
  • Test on other Lunar Lake devices (different OEMs)
  • Test Docker path on Lunar Lake

claude and others added 30 commits March 20, 2026 09:57
Documents incompatibility — this project targets discrete Arc Pro B60
GPUs with SYCL/oneAPI, not integrated Xe2 iGPUs on Lunar Lake.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
SYCL ESIMD kernels are fully portable to Xe2 (no hardcoded device IDs
or BMG-specific constants). This adds the infrastructure adaptations:

- Dockerfile.lunar-lake: lightweight single-GPU image for iGPU
- lunar_lake_serve.sh: memory-aware vLLM launch with shared memory config
- lunar_lake_evaluation.sh: iGPU platform evaluation (skip P2P/CCL)
- platform_basic_evaluation.sh: detect Lunar Lake iGPU alongside B60
- LUNAR_LAKE_COMPATIBILITY.md: full docs with quick start and model recs

Key settings for Lunar Lake: TP=1, CCL_TOPO_P2P_ACCESS=0,
gpu-memory-utilization=0.7, enforce-eager, INT4 quantization recommended.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
One-command installer that handles:
- oneAPI Base Toolkit via DNF
- Level-Zero for Xe2 iGPU
- PyTorch XPU in a venv
- Patched vLLM build from source
- vllm-xpu-kernels + triton-xpu
- Bash aliases for quick launch (vllm-serve)

No Docker required. Targets Nobara 43 / Fedora 42+.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Level-Zero packages aren't in Nobara/Fedora default repos. Now:
- Adds Intel oneAPI + compute-runtime repos first
- Tries multiple Level-Zero package names (varies across distros)
- Uses --skip-unavailable to handle missing packages gracefully
- Falls back to checking if libze_loader exists from xe driver

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Two issues on Nobara 43:
1. oneAPI setvars.sh hangs when MPI probes network interfaces over SSH.
   Fix: set ONEAPI_SETVARS_MPI_INSTALL=0 before sourcing.
2. Nobara 43 ships Python 3.14 but PyTorch XPU only has wheels for <=3.12.
   Fix: detect version, auto-install python3.12 via dnf if needed.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
setvars.sh has unbound variables and non-zero exits internally that
trigger our set -euo pipefail, silently killing the script at Phase 2.
Fix: temporarily set +euo pipefail around the source call, pipe output
through grep to show only component init lines, then restore strict mode.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
The pipe operator causes `source` to run in a subshell, so all oneAPI
environment variables were lost in the parent shell. Redirect to a temp
file instead and grep it separately.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
The +xpu version suffix doesn't exist on PyPI. XPU wheels are served
from a separate index URL. Also remove tail pipe so download progress
is visible.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Use exact version pins from official Intel XPU install docs to avoid
pulling incompatible versions.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
PyTorch XPU needs libze_intel_gpu.so (the GPU userspace driver) to
talk to the xe kernel driver via Level-Zero. This is provided by the
intel-compute-runtime package, which was missing from the install list.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
All the | tail -N pipes were hiding pip install and build output,
making the script appear frozen during long operations.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Running from ~/llm-scaler which has a vllm/ subdirectory causes
'import vllm' to succeed even when vLLM is not installed, skipping
the entire build phase. Use 'pip show vllm' instead.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Show full build output so errors are visible. Also clean up stale
vllm-xpu-kernels directory if previous build failed.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Preload oneAPI MKL libraries (LD_PRELOAD) in both serve and install scripts
to fix "Cannot load libmkl_core.so.2" caused by PyTorch's broken relative
RPATH in venvs. Set MAX_JOBS=2 for xpu-kernels build to prevent OOM kills
on 32GB shared-memory systems.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Change MAX_JOBS from 2 to 6 (75% CPU, ~24GB peak RAM, safe for 32GB)
- Make MAX_JOBS overridable via environment variable
- Add log message warning that 933 SYCL files take 1.5-2 hours on Lunar Lake
- Add comment reminding users to plug in before building

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…unar Lake

Real-world testing on MSI Claw 8 AI+ (Arc 140V, 32GB shared) revealed:
- oneCCL KVS init fails on WiFi-only devices ("can't find non-loopback interface")
  Added CCL_ZE_ENABLE=0, CCL_ATL_TRANSPORT=ofi, FI_PROVIDER=tcp workaround
- --device xpu is not a valid vLLM CLI flag; device must be set via
  VLLM_TARGET_DEVICE=xpu environment variable
- xpu_worker.py all_reduce warmup must be patched out for single-GPU (TP=1)
- Pre-quantized AutoRound/GPTQ 35B models OOM during weight loading on 32GB
  shared memory (peak memory doubles due to INT4→FP16 unpacking)
- vllm-xpu-kernels build takes 1.5-2 hours on Lunar Lake (933 SYCL files)
- GPU crash (DEVICE_LOST) after OOM requires full system reboot

Updated: install_lunar_lake.sh, lunar_lake_serve.sh, Dockerfile.lunar-lake,
LUNAR_LAKE_COMPATIBILITY.md

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…g results

Add detailed model compatibility matrix based on real testing on MSI Claw 8
(Core Ultra 7 258V, Arc 140V). Key findings:
- Triton XPU backend broken on Xe2 (blocks Qwen3.5 fla/linear attention)
- Marlin kernels CUDA-only (blocks AWQ/GPTQ compressed-tensors)
- Pre-quantized INT4 doubles peak memory (blocks >14B models on 32GB)
- Only FP16 base + online quantization works reliably

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…timeouts)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tested Intel/Qwen3-8B-int4-AutoRound on MSI Claw 8 (32GB LPDDR5x):
- Single-request: 17.6 tok/s generation
- Batched peak: 90 tok/s (5 concurrent short requests)
- Long context (4K+2K): 50.2 tok/s output, 150.5 tok/s total
- Model loads in 7s using only 5.69 GiB

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ility-CB5w6

Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results
…y 5)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documented working setup for Qwen3-TTS-12Hz-1.7B-Base on Arc 140V:
- Venv creation with shared XPU PyTorch from vLLM install
- transformers==4.57.3 pinning (newer versions break qwen-tts)
- Voice cloning example with generate_voice_clone() API
- Tested and confirmed working on MSI Claw 8 AI+

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Add consolidated "Running Recipes" section with LLM/ASR/TTS commands and memory budget table
- Add Qwen3-ASR-1.7B setup, serve, and test documentation
- Use 127.0.0.1 (localhost) since OpenClaw accesses services locally
- ASR uses 0.25 GPU utilization (~7.2GB) for efficient memory usage

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Qwen3-30B-A3B GPTQ INT4 routes to IPEX (bypasses CUDA Marlin) but
OOMs during MoE expert weight shuffle at 15.7 GiB — GPU enters
DEVICE_LOST state. Confirms 30B MoE models don't fit on 32GB shared memory.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- New install script install_meteor_arrow_lake.sh supporting:
  - Meteor Lake Xe-LPG (Core Ultra 155H/135H): PCI 7d55, 7dd5, 7d40, 7d45
  - Arrow Lake Xe-LPG+ (Core Ultra 255H/245H): PCI 7d51, 7dd1, 7d41, 7d67
- Auto-detects platform and GPU via PCI device IDs
- Warns Meteor Lake users to switch from i915 to xe driver
- Adjusts memory recommendations based on system RAM (16-96GB)
- Updated evaluation script detect_gpu() for all three platforms
- Added platform compatibility table to LUNAR_LAKE_COMPATIBILITY.md

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
… transformers 5.x catch-22

- Add server-side engine log observations: generation throughput decay with
  context length, prefix cache hit rates, KV cache usage per concurrent request
- Add Qwen3.5-4B AutoRound failure: multimodal model hits transformers 5.x
  Qwen2VLImageProcessor.max_pixels API break
- Add LFM2-24B-A2B AWQ failure: custom Liquid AI tokenizer unsupported
- Document transformers version catch-22: 4.x can't recognize new architectures,
  5.x breaks vLLM multimodal code
- Add critical blocker: transformers 5.x vs vLLM mismatch for multimodal models

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…rc.patch

Intel's llm-scaler Docker image uses transformers from git HEAD (5.x) with
vllm_for_multi_arc.patch applied, which adds full Qwen3.5 architecture support
(min_transformers_version="5.1.0"). The max_pixels AttributeError is caused by
transformers 5.x renaming image_processor.max_pixels to size["longest_edge"].
Added one-line getattr() fix for native installs without the full patch.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Confirmed that Qwen3.5-4B AutoRound INT4 successfully loads (3.68 GiB —
well within Lunar Lake's 28.6 GB budget) after applying the max_pixels
getattr() fix, but crashes during warmup on the Triton kernel in
fla/ops/layernorm_guard.py. Even Intel's forward_xpu code path for
Qwen3.5 routes through fla/ops layernorm which requires @triton.jit.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
claude and others added 19 commits March 30, 2026 13:15
Highlights the sweet spot tradeoff: 0.35 has best batched TTFT (664ms)
but only 5% KV headroom over 32K context. 0.42 provides 57% headroom
with identical single-user performance. 0.8 has 22s batched TTFT due
to massive KV cache management overhead.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
AWQ: compressed-tensors MoE path hits Marlin (dense XPU works fine).
AutoRound: IPEX routing works correctly but OOM DEVICE_LOST during
weight init — 27B model too large even at INT4 with 32GB swap.
Updated Qwen3-Coder-30B-A3B AWQ with same root cause.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ctive)

17.66 GiB at BF16, ~5 tok/s single-user (205ms TPOT) — 4-5x slower than
Qwen3.5-4B. Not viable for interactive chat on Lunar Lake. Added benchmark
section and model table entry.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ded)

Re-ran all Qwen3-8B INT4 benchmarks to verify XPU acceleration after
triton-xpu patch. Results match previous run within noise. Added 2048/2048
context data points. Updated comparison table with verified numbers.

Key results: 18.7 tok/s (128), 13.7 tok/s (1K), 13.3 tok/s (2K) single-user.
Batched: 90/80/75 peak tok/s. 116,672 token KV cache at 0.8 util.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ster than BF16)

FP8 online quantization: 11.22 GiB (vs 17.66 GiB BF16), 117ms TPOT batched.
Still too slow for interactive chat (~8.5 tok/s vs 4B's 23 tok/s).
Documents FP8 as only working online quantization on XPU native install.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
sym_int4 now works on native XPU install by building vllm_int4_for_multi_arc.so
from intel/BigDL-core source. Documents full build process for both CPU-side
quantizer and GPU-side fused INT4 GEMM kernel, plus required IPEX patch for
PyTorch 2.10 compatibility.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Key finding: VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1 is the critical enabler
for sym_int4 on Lunar Lake, NOT the BigDL-core .so files. vLLM has built-in
Python INT4 quantization. sym_int4 achieves 14.7 tok/s single-user (2.9x
faster than BF16, 1.7x faster than FP8) at only 8.11 GiB (54% smaller).

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Found the actual source code in intel/llm-scaler vllm_for_multi_arc.patch:
sym_int4.py loads vllm_int4_for_multi_arc.so via ctypes.CDLL() from
/opt/lib/ (hardcoded default). There is no Python fallback - the C lib
is required. VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1 controls CPU-side
weight loading to avoid GPU OOM during quantization.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
The vllm_int4_for_multi_arc.so is just a renamed libquantize.so from
BigDL-core's GGML quantizer - a 12KB pure C library. Intel only ships
it inside 10+ GB Docker images. Building from source takes 10 seconds.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Single-user: 14.7 tok/s at 128, 10.6 at 1024, 10.5 at 2048 — remarkably
stable. Batched: 37.15 tok/s at 1024/1024, 34.21 at 2048/2048. sym_int4
is 2.3-3.2x faster than BF16 across all context lengths.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- gpt-oss-20b (MXFP4, 13.27 GiB) hangs during warmup on XPU
- Document ~10B practical limit for sym_int4 on 32GB shared memory
- Add sym_int4 9B entry to recommended models table
- Note IPEX marlin only covers MXFP4, not GPTQ/AWQ
- Link to vLLM #30359 QeRL RFC for AutoRound loading peak fix

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Documents real-world impact of INT4→FP16 unpack memory spike on 32GB
shared memory iGPU platforms. Includes sym_int4 workaround benchmarks
and explains why layerwise weight processing would solve the problem.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Both AutoRound INT4 and sym_int4 use layer-by-layer process_weights_after_loading,
so peak ≈ initial load + one layer overhead, not old+new format for entire model.
Updated vLLM #30359 comment with corrected numbers and real OOM root causes
(profile_run, GatedMLPMOE copies, HF deserialization overhead).

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
… init

vLLM's XPU worker runs a dummy forward pass (profile_run()) to measure
peak memory for KV cache sizing. On Lunar Lake iGPU (Arc 140V / Xe2),
this forward pass hangs indefinitely at 100% CPU, making the server
unresponsive.

Add VLLM_SKIP_PROFILE_RUN=1 env var to bypass the dummy forward pass
and estimate peak memory from current allocation + 20% overhead instead.
Includes the xpu_worker.py patch and updates lunar_lake_serve.sh.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous patch was exported from vllm source repo which only had the
VLLM_FALLBACK_PROFILE refactoring but not the VLLM_SKIP_PROFILE_RUN
skip logic. Re-exported from the venv's installed copy which has the
complete fix including the profile_run() bypass on Lunar Lake.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reusable recipe: single-user (128/1024/2048) + batched (5 concurrent).
Usage: ./benchmark.sh <model_path> <port>

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Move gpt-oss-20b from "Does NOT Work" to "Recommended Models" (resolved)
- Add VLLM_SKIP_PROFILE_RUN=1 to environment variables section
- Add full running recipe with quick start, manual launch, expected
  startup log, and key notes

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
First MoE model successfully running on Lunar Lake iGPU via
VLLM_SKIP_PROFILE_RUN=1 patch. 13.27 GiB MXFP4 with ipex marlin
backend. 1.53-1.63x faster than 9B sym_int4 in single-user decode.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
@plumlis
Copy link
Copy Markdown

plumlis commented Mar 31, 2026

I love this.
Does dockfile is LNL only? No ARL-H and MTL-H support?

claude and others added 7 commits April 1, 2026 01:16
- Upgrade recipe from 8K to 32K context (88,576 KV cache tokens auto-allocated)
- Remove --num-gpu-blocks-override (not needed, vLLM auto-estimates correctly)
- Add tool calling (--tool-call-parser openai) and reasoning (--reasoning-parser openai_gptoss)
- Document 3 thinking levels (low/medium/high) via system prompt
- Fix GLM-4.7 entry: 30B-A3B MoE, not 27B dense
- Correct memory analysis: layer-by-layer processing (not 2x bulk), ~13 GiB practical ceiling
- Fix stale "2x memory spike" references throughout limitations section

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…s test

- Re-benchmarked with production config: --max-model-len 32768, tool calling,
  reasoning parser (openai_gptoss)
- Added 16K×5 concurrent stress test: 48.2% KV cache, no OOM
- Confirmed --num-gpu-blocks-override not needed (88,576 tokens auto-allocated)
- 32K config has no measurable perf impact vs 8K config
- Fixed comparison table with updated numbers

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- 16384/512 single-user: TPOT 63.5ms (15.7 tok/s), slight KV bandwidth slowdown
- Prefix cache effect documented: 398ms TTFT with warm cache vs ~9-15s cold
- KV cache usage: 9.6% single-user, 48.2% at 5-concurrent

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Cold start (0% cache): 20,337ms TTFT for 16K input (8 chunked prefill steps)
Warm cache (57.7% hit): 398ms TTFT — 51x faster with prefix caching
Decode speed identical: 64ms TPOT regardless of cache state

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Promote gpt-oss-20b as default OpenClaw agent model (replaces Qwen3.5-4B)
- Reorganize running recipes: gpt-oss-20b primary, Qwen3.5-4B fallback
- Update memory budget table with gpt-oss-20b configurations
- Fix stale "Qwen3.5-4B remains best" verdict
- Correct cold-start TTFT reference (confirmed 20.3s, not estimated 9-15s)
- Update date to 2026-04-01

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ility-CB5w6

MSI Claw 8 AI+: a feasible OpenClaw local running device
@MegaStood
Copy link
Copy Markdown
Author

actually, i build it from the dev version and did not use the released version. it takes around 3 hours to build. Will try to figure out if it could run on 155H and 255H.

@MegaStood
Copy link
Copy Markdown
Author

Closing this PR — split into focused PRs for easier review:

#334 — GLM-4.7-Flash MLA bug analysis and patches
#340 — VLLM_SKIP_PROFILE_RUN patch for iGPU profile_run() hang
#342 — Lunar Lake Xe2 compatibility report and benchmarks

@MegaStood MegaStood closed this Apr 2, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants