Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results#335
Closed
MegaStood wants to merge 66 commits intointel:mainfrom
Closed
Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results#335MegaStood wants to merge 66 commits intointel:mainfrom
MegaStood wants to merge 66 commits intointel:mainfrom
Conversation
Documents incompatibility — this project targets discrete Arc Pro B60 GPUs with SYCL/oneAPI, not integrated Xe2 iGPUs on Lunar Lake. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
SYCL ESIMD kernels are fully portable to Xe2 (no hardcoded device IDs or BMG-specific constants). This adds the infrastructure adaptations: - Dockerfile.lunar-lake: lightweight single-GPU image for iGPU - lunar_lake_serve.sh: memory-aware vLLM launch with shared memory config - lunar_lake_evaluation.sh: iGPU platform evaluation (skip P2P/CCL) - platform_basic_evaluation.sh: detect Lunar Lake iGPU alongside B60 - LUNAR_LAKE_COMPATIBILITY.md: full docs with quick start and model recs Key settings for Lunar Lake: TP=1, CCL_TOPO_P2P_ACCESS=0, gpu-memory-utilization=0.7, enforce-eager, INT4 quantization recommended. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
One-command installer that handles: - oneAPI Base Toolkit via DNF - Level-Zero for Xe2 iGPU - PyTorch XPU in a venv - Patched vLLM build from source - vllm-xpu-kernels + triton-xpu - Bash aliases for quick launch (vllm-serve) No Docker required. Targets Nobara 43 / Fedora 42+. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Level-Zero packages aren't in Nobara/Fedora default repos. Now: - Adds Intel oneAPI + compute-runtime repos first - Tries multiple Level-Zero package names (varies across distros) - Uses --skip-unavailable to handle missing packages gracefully - Falls back to checking if libze_loader exists from xe driver https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Two issues on Nobara 43: 1. oneAPI setvars.sh hangs when MPI probes network interfaces over SSH. Fix: set ONEAPI_SETVARS_MPI_INSTALL=0 before sourcing. 2. Nobara 43 ships Python 3.14 but PyTorch XPU only has wheels for <=3.12. Fix: detect version, auto-install python3.12 via dnf if needed. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
setvars.sh has unbound variables and non-zero exits internally that trigger our set -euo pipefail, silently killing the script at Phase 2. Fix: temporarily set +euo pipefail around the source call, pipe output through grep to show only component init lines, then restore strict mode. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
The pipe operator causes `source` to run in a subshell, so all oneAPI environment variables were lost in the parent shell. Redirect to a temp file instead and grep it separately. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
The +xpu version suffix doesn't exist on PyPI. XPU wheels are served from a separate index URL. Also remove tail pipe so download progress is visible. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Use exact version pins from official Intel XPU install docs to avoid pulling incompatible versions. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
PyTorch XPU needs libze_intel_gpu.so (the GPU userspace driver) to talk to the xe kernel driver via Level-Zero. This is provided by the intel-compute-runtime package, which was missing from the install list. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
All the | tail -N pipes were hiding pip install and build output, making the script appear frozen during long operations. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Running from ~/llm-scaler which has a vllm/ subdirectory causes 'import vllm' to succeed even when vLLM is not installed, skipping the entire build phase. Use 'pip show vllm' instead. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Show full build output so errors are visible. Also clean up stale vllm-xpu-kernels directory if previous build failed. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Preload oneAPI MKL libraries (LD_PRELOAD) in both serve and install scripts to fix "Cannot load libmkl_core.so.2" caused by PyTorch's broken relative RPATH in venvs. Set MAX_JOBS=2 for xpu-kernels build to prevent OOM kills on 32GB shared-memory systems. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Change MAX_JOBS from 2 to 6 (75% CPU, ~24GB peak RAM, safe for 32GB) - Make MAX_JOBS overridable via environment variable - Add log message warning that 933 SYCL files take 1.5-2 hours on Lunar Lake - Add comment reminding users to plug in before building https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…unar Lake
Real-world testing on MSI Claw 8 AI+ (Arc 140V, 32GB shared) revealed:
- oneCCL KVS init fails on WiFi-only devices ("can't find non-loopback interface")
Added CCL_ZE_ENABLE=0, CCL_ATL_TRANSPORT=ofi, FI_PROVIDER=tcp workaround
- --device xpu is not a valid vLLM CLI flag; device must be set via
VLLM_TARGET_DEVICE=xpu environment variable
- xpu_worker.py all_reduce warmup must be patched out for single-GPU (TP=1)
- Pre-quantized AutoRound/GPTQ 35B models OOM during weight loading on 32GB
shared memory (peak memory doubles due to INT4→FP16 unpacking)
- vllm-xpu-kernels build takes 1.5-2 hours on Lunar Lake (933 SYCL files)
- GPU crash (DEVICE_LOST) after OOM requires full system reboot
Updated: install_lunar_lake.sh, lunar_lake_serve.sh, Dockerfile.lunar-lake,
LUNAR_LAKE_COMPATIBILITY.md
https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…g results Add detailed model compatibility matrix based on real testing on MSI Claw 8 (Core Ultra 7 258V, Arc 140V). Key findings: - Triton XPU backend broken on Xe2 (blocks Qwen3.5 fla/linear attention) - Marlin kernels CUDA-only (blocks AWQ/GPTQ compressed-tensors) - Pre-quantized INT4 doubles peak memory (blocks >14B models on 32GB) - Only FP16 base + online quantization works reliably https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…timeouts) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Tested Intel/Qwen3-8B-int4-AutoRound on MSI Claw 8 (32GB LPDDR5x): - Single-request: 17.6 tok/s generation - Batched peak: 90 tok/s (5 concurrent short requests) - Long context (4K+2K): 50.2 tok/s output, 150.5 tok/s total - Model loads in 7s using only 5.69 GiB https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ility-CB5w6 Add Lunar Lake (32GB) support: Xe2 compatibility fixes and benchmark results
…y 5) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Documented working setup for Qwen3-TTS-12Hz-1.7B-Base on Arc 140V: - Venv creation with shared XPU PyTorch from vLLM install - transformers==4.57.3 pinning (newer versions break qwen-tts) - Voice cloning example with generate_voice_clone() API - Tested and confirmed working on MSI Claw 8 AI+ https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Add consolidated "Running Recipes" section with LLM/ASR/TTS commands and memory budget table - Add Qwen3-ASR-1.7B setup, serve, and test documentation - Use 127.0.0.1 (localhost) since OpenClaw accesses services locally - ASR uses 0.25 GPU utilization (~7.2GB) for efficient memory usage https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Qwen3-30B-A3B GPTQ INT4 routes to IPEX (bypasses CUDA Marlin) but OOMs during MoE expert weight shuffle at 15.7 GiB — GPU enters DEVICE_LOST state. Confirms 30B MoE models don't fit on 32GB shared memory. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- New install script install_meteor_arrow_lake.sh supporting: - Meteor Lake Xe-LPG (Core Ultra 155H/135H): PCI 7d55, 7dd5, 7d40, 7d45 - Arrow Lake Xe-LPG+ (Core Ultra 255H/245H): PCI 7d51, 7dd1, 7d41, 7d67 - Auto-detects platform and GPU via PCI device IDs - Warns Meteor Lake users to switch from i915 to xe driver - Adjusts memory recommendations based on system RAM (16-96GB) - Updated evaluation script detect_gpu() for all three platforms - Added platform compatibility table to LUNAR_LAKE_COMPATIBILITY.md https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
… transformers 5.x catch-22 - Add server-side engine log observations: generation throughput decay with context length, prefix cache hit rates, KV cache usage per concurrent request - Add Qwen3.5-4B AutoRound failure: multimodal model hits transformers 5.x Qwen2VLImageProcessor.max_pixels API break - Add LFM2-24B-A2B AWQ failure: custom Liquid AI tokenizer unsupported - Document transformers version catch-22: 4.x can't recognize new architectures, 5.x breaks vLLM multimodal code - Add critical blocker: transformers 5.x vs vLLM mismatch for multimodal models https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…rc.patch Intel's llm-scaler Docker image uses transformers from git HEAD (5.x) with vllm_for_multi_arc.patch applied, which adds full Qwen3.5 architecture support (min_transformers_version="5.1.0"). The max_pixels AttributeError is caused by transformers 5.x renaming image_processor.max_pixels to size["longest_edge"]. Added one-line getattr() fix for native installs without the full patch. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Confirmed that Qwen3.5-4B AutoRound INT4 successfully loads (3.68 GiB — well within Lunar Lake's 28.6 GB budget) after applying the max_pixels getattr() fix, but crashes during warmup on the Triton kernel in fla/ops/layernorm_guard.py. Even Intel's forward_xpu code path for Qwen3.5 routes through fla/ops layernorm which requires @triton.jit. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Highlights the sweet spot tradeoff: 0.35 has best batched TTFT (664ms) but only 5% KV headroom over 32K context. 0.42 provides 57% headroom with identical single-user performance. 0.8 has 22s batched TTFT due to massive KV cache management overhead. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
AWQ: compressed-tensors MoE path hits Marlin (dense XPU works fine). AutoRound: IPEX routing works correctly but OOM DEVICE_LOST during weight init — 27B model too large even at INT4 with 32GB swap. Updated Qwen3-Coder-30B-A3B AWQ with same root cause. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ctive) 17.66 GiB at BF16, ~5 tok/s single-user (205ms TPOT) — 4-5x slower than Qwen3.5-4B. Not viable for interactive chat on Lunar Lake. Added benchmark section and model table entry. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ded) Re-ran all Qwen3-8B INT4 benchmarks to verify XPU acceleration after triton-xpu patch. Results match previous run within noise. Added 2048/2048 context data points. Updated comparison table with verified numbers. Key results: 18.7 tok/s (128), 13.7 tok/s (1K), 13.3 tok/s (2K) single-user. Batched: 90/80/75 peak tok/s. 116,672 token KV cache at 0.8 util. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ster than BF16) FP8 online quantization: 11.22 GiB (vs 17.66 GiB BF16), 117ms TPOT batched. Still too slow for interactive chat (~8.5 tok/s vs 4B's 23 tok/s). Documents FP8 as only working online quantization on XPU native install. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
sym_int4 now works on native XPU install by building vllm_int4_for_multi_arc.so from intel/BigDL-core source. Documents full build process for both CPU-side quantizer and GPU-side fused INT4 GEMM kernel, plus required IPEX patch for PyTorch 2.10 compatibility. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Key finding: VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1 is the critical enabler for sym_int4 on Lunar Lake, NOT the BigDL-core .so files. vLLM has built-in Python INT4 quantization. sym_int4 achieves 14.7 tok/s single-user (2.9x faster than BF16, 1.7x faster than FP8) at only 8.11 GiB (54% smaller). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Found the actual source code in intel/llm-scaler vllm_for_multi_arc.patch: sym_int4.py loads vllm_int4_for_multi_arc.so via ctypes.CDLL() from /opt/lib/ (hardcoded default). There is no Python fallback - the C lib is required. VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1 controls CPU-side weight loading to avoid GPU OOM during quantization. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
The vllm_int4_for_multi_arc.so is just a renamed libquantize.so from BigDL-core's GGML quantizer - a 12KB pure C library. Intel only ships it inside 10+ GB Docker images. Building from source takes 10 seconds. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Single-user: 14.7 tok/s at 128, 10.6 at 1024, 10.5 at 2048 — remarkably stable. Batched: 37.15 tok/s at 1024/1024, 34.21 at 2048/2048. sym_int4 is 2.3-3.2x faster than BF16 across all context lengths. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- gpt-oss-20b (MXFP4, 13.27 GiB) hangs during warmup on XPU - Document ~10B practical limit for sym_int4 on 32GB shared memory - Add sym_int4 9B entry to recommended models table - Note IPEX marlin only covers MXFP4, not GPTQ/AWQ - Link to vLLM #30359 QeRL RFC for AutoRound loading peak fix https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Documents real-world impact of INT4→FP16 unpack memory spike on 32GB shared memory iGPU platforms. Includes sym_int4 workaround benchmarks and explains why layerwise weight processing would solve the problem. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Both AutoRound INT4 and sym_int4 use layer-by-layer process_weights_after_loading, so peak ≈ initial load + one layer overhead, not old+new format for entire model. Updated vLLM #30359 comment with corrected numbers and real OOM root causes (profile_run, GatedMLPMOE copies, HF deserialization overhead). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
… init vLLM's XPU worker runs a dummy forward pass (profile_run()) to measure peak memory for KV cache sizing. On Lunar Lake iGPU (Arc 140V / Xe2), this forward pass hangs indefinitely at 100% CPU, making the server unresponsive. Add VLLM_SKIP_PROFILE_RUN=1 env var to bypass the dummy forward pass and estimate peak memory from current allocation + 20% overhead instead. Includes the xpu_worker.py patch and updates lunar_lake_serve.sh. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Previous patch was exported from vllm source repo which only had the VLLM_FALLBACK_PROFILE refactoring but not the VLLM_SKIP_PROFILE_RUN skip logic. Re-exported from the venv's installed copy which has the complete fix including the profile_run() bypass on Lunar Lake. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Reusable recipe: single-user (128/1024/2048) + batched (5 concurrent). Usage: ./benchmark.sh <model_path> <port> https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Move gpt-oss-20b from "Does NOT Work" to "Recommended Models" (resolved) - Add VLLM_SKIP_PROFILE_RUN=1 to environment variables section - Add full running recipe with quick start, manual launch, expected startup log, and key notes Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
First MoE model successfully running on Lunar Lake iGPU via VLLM_SKIP_PROFILE_RUN=1 patch. 13.27 GiB MXFP4 with ipex marlin backend. 1.53-1.63x faster than 9B sym_int4 in single-user decode. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
|
I love this. |
- Upgrade recipe from 8K to 32K context (88,576 KV cache tokens auto-allocated) - Remove --num-gpu-blocks-override (not needed, vLLM auto-estimates correctly) - Add tool calling (--tool-call-parser openai) and reasoning (--reasoning-parser openai_gptoss) - Document 3 thinking levels (low/medium/high) via system prompt - Fix GLM-4.7 entry: 30B-A3B MoE, not 27B dense - Correct memory analysis: layer-by-layer processing (not 2x bulk), ~13 GiB practical ceiling - Fix stale "2x memory spike" references throughout limitations section https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…s test - Re-benchmarked with production config: --max-model-len 32768, tool calling, reasoning parser (openai_gptoss) - Added 16K×5 concurrent stress test: 48.2% KV cache, no OOM - Confirmed --num-gpu-blocks-override not needed (88,576 tokens auto-allocated) - 32K config has no measurable perf impact vs 8K config - Fixed comparison table with updated numbers https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- 16384/512 single-user: TPOT 63.5ms (15.7 tok/s), slight KV bandwidth slowdown - Prefix cache effect documented: 398ms TTFT with warm cache vs ~9-15s cold - KV cache usage: 9.6% single-user, 48.2% at 5-concurrent https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Cold start (0% cache): 20,337ms TTFT for 16K input (8 chunked prefill steps) Warm cache (57.7% hit): 398ms TTFT — 51x faster with prefix caching Decode speed identical: 64ms TPOT regardless of cache state https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
- Promote gpt-oss-20b as default OpenClaw agent model (replaces Qwen3.5-4B) - Reorganize running recipes: gpt-oss-20b primary, Qwen3.5-4B fallback - Update memory budget table with gpt-oss-20b configurations - Fix stale "Qwen3.5-4B remains best" verdict - Correct cold-start TTFT reference (confirmed 20.3s, not estimated 9-15s) - Update date to 2026-04-01 https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…ility-CB5w6 MSI Claw 8 AI+: a feasible OpenClaw local running device
Author
|
actually, i build it from the dev version and did not use the released version. it takes around 3 hours to build. Will try to figure out if it could run on 155H and 255H. |
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds full Lunar Lake (Intel Core Ultra 7 258V / Arc 140V iGPU / 32GB LPDDR5x) support for vLLM SYCL, tested on MSI Claw 8 AI+ running Nobara 43.
Changes
Benchmark Results (Qwen3-8B INT4 AutoRound on Arc 140V)
Key Findings
Test plan