docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU#334
Open
MegaStood wants to merge 39 commits intointel:mainfrom
Open
docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU#334MegaStood wants to merge 39 commits intointel:mainfrom
MegaStood wants to merge 39 commits intointel:mainfrom
Conversation
…for Lunar Lake XPU
985ca80 to
5535bc7
Compare
6 tasks
…uffle Key discovery: gpt-oss-20b (also MoE) loads fine because MXFP4 stores weights pre-formatted for Marlin kernels. INT4 AutoRound requires runtime reshuffling that OOMs on 32GB shared memory. Added quantization format compatibility table and MXFP4 re-quantization as most promising path. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Key findings: - Usable GPU memory (28.57 GiB) != physical RAM (32 GiB) - gpu-memory-utilization is fraction of usable, not physical - Skip-profile 1.05x multiplier yields 2.67x more KV cache vs 1.2x default - Multi-model util values must sum < 1.0 (same shared pool) - KV cost ~48 KB/token for gpt-oss-20b https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
RAM-based MAX_JOBS auto-detection for native builds on shared-memory iGPUs: 3 jobs for 16GB (Claw A1M), 6 for 32GB (Claw 8 AI+). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Auto-detect system RAM at build time to set parallel compile jobs: - 16GB (Claw A1M, Meteor Lake): MAX_JOBS=3 - 32GB (Claw 8 AI+, Lunar Lake): MAX_JOBS=6 - 64GB+: MAX_JOBS=8 Prevents OOM during vLLM compilation on shared-memory iGPU systems. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Extracts Dockerfile build steps into a standalone bash script for bare-metal installation on Lunar Lake / Meteor Lake / Arrow Lake. Key features: - RAM-based MAX_JOBS: 3 for 16GB (Claw A1M), 6 for 32GB (Claw 8 AI+) - Applies vllm_for_multi_arc.patch from repo - Builds vLLM v0.14.0 + vllm-xpu-kernels + triton-xpu - Configures production env vars in ~/.bashrc - Skips already-completed steps on re-run Usage: sudo bash vllm/scripts/install_vllm_native.sh https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Documents planned setup: 2x DGX Spark for 120B models, Claw 8 for portable 20B MoE, Claw A1M as edge client with NPU offloading ASR/TTS. Includes OpenVINO runtime details, NPU vs iGPU memory separation, and model compatibility table for NPU inference on Meteor Lake. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Documents the 8-step native build order with --no-build-isolation explanation, .so library table, and Docker vs native build comparison with per-device recommendations. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Backports all best practices discovered during Meteor Lake 16GB testing: - Add Level Zero + ocloc/IGC packages (prevents torch.xpu.device_count()=0) - Add llvm-foreach symlink (fixes SYCL AOT linker failure at object 925/933) - Switch from break-system-packages to Python 3.12 venv isolation - Redirect TMPDIR to disk (prevents tmpfs overflow on 16GB systems) - Use --no-deps for transformers git install (prevents CUDA torch overwrite) - Add torch XPU verification after dependency installs - Add idempotent skip checks (pip show) so re-runs don't rebuild - Add chown ownership fix before xpu-kernels build - Clean stale build artifacts and corrupted pip metadata - Add post-build verification for vLLM and xpu-kernels - Improve bashrc block with venv activation and oneAPI ordering note https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
On 16GB systems (Meteor Lake), the xpu-kernels build peaks at ~21GB and will OOM without disk swap. Replaces zram with 16GB disk swapfile. On 32GB systems (Lunar Lake), offers optional overflow swapfile and temporarily disables zram during the build to free all RAM. Ported from claw-post-install-vllm.sh Phase 0, adapted for Ubuntu. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Tries downloading precompiled vLLM and xpu-kernels wheels from GitHub Releases before falling back to source compilation. Saves 1-2 hours on 16GB systems. Set VLLM_BUILD_FROM_SOURCE=1 to force source build. Wheel upload instructions: cd ~/llm/vllm && pip wheel --no-build-isolation --no-deps -w dist/ . cd ~/llm/vllm-xpu-kernels && pip wheel --no-build-isolation --no-deps -w dist/ . gh release create vllm-xpu-wheels-v0.14.0 dist/*.whl https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Documents that xccl and ccl distributed backends hang during Level Zero initialization on Meteor Lake's Xe-LPG iGPU. Workaround: set VLLM_XPU_DIST_BACKEND=gloo. Includes architecture comparison table (Xe-LPG vs Xe2-LPG vs Xe-HPG) and references. Discovered during Claw A1M (16GB Meteor Lake) testing. Lunar Lake and discrete Arc GPUs are not affected. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Meteor Lake Xe-LPG lacks XMX (matrix extensions), causing vLLM to crash on first inference with "SDP kernel requires XMX". This is a hardware limitation with no software workaround in vLLM. Documents IPEX-LLM (Ollama) as the only viable alternative for LLM inference on Meteor Lake. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Add potential future workaround via PyTorch >= 2.9 SDPBackend.MATH on XPU (PR #156669). Document that SYCL joint_matrix has no XMX fallback per Intel docs, and that no IPEX env var exists to disable XMX kernels. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Add detailed alternatives section: IPEX-LLM detects Meteor Lake as "mtl" with architecture-specific optimizations, llama.cpp SYCL backend is verified, OpenVINO also works. Add DP4a vs DPAS performance context and first-run JIT compilation note. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
IPEX-LLM/Ollama is GGUF-only, so new model architectures are delayed. Added OpenVINO GenAI as recommended option for new model support (auto- converts HuggingFace models, no XMX needed). Added HF Transformers with eager attention as prototyping option. Added decision table for choosing between alternatives. Added llama.cpp Vulkan as minimal-dependency option. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…on Lunar Lake Qwen3-4B INT4 on MSI Claw A1M (Meteor Lake, no XMX): 26.4 tokens/s decode, 176ms TTFT. Comparable to vLLM on Lunar Lake with XMX. Bottleneck is memory bandwidth, not compute — XMX absence has less impact than expected for 4B-class models. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…Lake Root cause: quadratic attention memory growth (O(n²)) during prefill causes GPU OOM at 8K tokens on 16GB shared memory systems. Documents workarounds: ContinuousBatchingPipeline with chunked prefill (dynamic_split_fuse), CPU fallback for long context, sliding window models (Qwen3.5-4B), KV cache eviction, BIOS GPU memory settings. Includes memory calculation tables and path to 32K context for OpenClaw. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…issue refs - Critical: enable_prefix_caching=False avoids integer overflow in openvino.genai#2406 (Qwen3-4B specific memory explosion) - Add CacheEvictionConfig parameters (max_cache_size, start_size, recent_size) - Add NPU prefill option with NPUW_LLM_PREFILL_CHUNK_SIZE - Reference openvino#31781, #32665, openvino_notebooks#2632 https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…lete 32K path Key additions: - KV_CACHE_PRECISION="u4" reduces 32K KV cache from 4.5GB to ~1.1GB - Correct max_num_batched_tokens default (256, not 2048) - Explain StatefulLLMPipeline has NO chunked prefill (must use SchedulerConfig) - Complete recommended 32K config with memory budget calculation - Note CPU/GPU KV cache transfer not supported in OpenVINO GenAI https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
OpenAI-compatible API server (FastAPI) using ContinuousBatchingPipeline with chunked prefill, INT4/INT8 KV cache, and prefix caching disabled. Supports /v1/chat/completions, /v1/completions, /v1/models endpoints. Designed for OpenClaw integration on MSI Claw handhelds (16-32GB). Default config: GPU device, u8 KV cache, 3GB cache budget, 512-token chunks — handles up to 32K input context on 16GB Meteor Lake iGPU. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
PR #15307 merged March 2026 — OpenVINO is now a native GGML backend. GGUF files work directly (no conversion), supports CPU/GPU/NPU. Benchmark: ~12.8 tok/s for 8B INT4, ~25 tok/s estimated for 4B INT4 on Meteor Lake iGPU (memory-bandwidth-bound, scales with model size). Updated choosing table with OpenVINO backend and NPU option. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…penVINO) Critical finding: Meteor Lake + Vulkan = crippled prefill (~14x slower than SYCL) because Xe-LPG lacks VK_KHR_cooperative_matrix support. OpenVINO backend recommended over Vulkan for Meteor Lake iGPU. Includes benchmark table with pp512/tg128 numbers across backends, sources from llama.cpp discussions, and Intel maintainer's recommendation. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…L benchmarks Tested on MSI Claw A1M (Meteor Lake iGPU) — OpenVINO backend fails with all K-quant models (Q4_K_M, Q4_K_XL): "failed to decode prompt batch, res = -3". CPY operation not implemented, breaking flash attention. Replaces speculative benchmark table with real measurements: - Gemma 4 E4B Q4_K_M: SYCL pp512=309, pp32768=258, tg128=15.1 - Vulkan: pp512=318, pp32768=172, tg128=14.2 - OpenVINO: completely non-functional with standard quants Updates recommendation table: SYCL replaces OpenVINO for GGUF workflows. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Previous data was single-run. Updated with proper 5-run averages showing SYCL faster at all context lengths. Added extended TG columns. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
AOT compilation with GGML_SYCL_DEVICE_ARCH=mtl_h shows no improvement over default JIT build: prefill identical, token generation ~4% slower. Recommendation updated to use default SYCL build (JIT). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
GGML_SYCL_F16=ON gives 463 tok/s prefill (vs 311 FP32) at pp512, and 345 vs 257 at pp32768 (1.8x faster than Vulkan at 32K). TG slightly slower (~5%) — tradeoff is compute-bound vs memory-bound. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
SYCL FP16 benchmarked at +49% faster prefill. Updated build example and recommendation text in Option 4 (llama.cpp SYCL/Vulkan). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…14.0 Gemma 4 model architecture (Gemma4ForCausalLM) requires vLLM v0.19.0+. Current llm-scaler patches are based on v0.14.0 and only support up to Gemma 3. Documents the AutoRound MoE code path analysis showing that XPUGPTQMarlinMoEMethod (IPEX GatedMLPMOE, no marlin_shuffle) would work once the architecture is backported, and the missing FusedMoE handling in apply_ipex_quant_layer that needs to be added. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…_vllm_native.sh Brings the Ubuntu install script up to parity with install_lunar_lake.sh and install_meteor_arrow_lake.sh: - Add GPU detection (Lunar/Meteor/Arrow Lake PCI device IDs) - Add xe vs i915 driver check with GRUB migration instructions - Add memory-based gpu-memory-utilization recommendations - Add MKL LD_PRELOAD fix for PyTorch venv RPATH breakage - Add xpu_worker.py CCL all_reduce warmup patch (single-GPU fix) - Add vllm-activate and oneapi convenience aliases - Fix setvars.sh sourcing with set +e (unbound vars in setvars.sh) - Add Python 3.14+ detection with auto-fallback to python3.12 - Improve summary with dynamic platform/memory recommendations https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
vLLM v0.19.0 has native Gemma 4 + Intel XPU support upstream — no vllm_for_multi_arc.patch needed. Installs alongside v0.14.0 in a separate directory (~/llm-scaler-vllm-v19/) so both can coexist. Key differences from v0.14.0 script: - No multi-arc patch (XPU support is upstream) - vllm-xpu-kernels v0.1.5 (latest, vs 4c83144) - transformers>=5.5.0 force-installed (vLLM pins <5, Gemma 4 needs >=5.5) - torch XPU overwrite protection after pip installs - Separate aliases (vllm-v19-activate) to coexist with v0.14.0 https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…r INT4 MoE Bug H is distinct from Bug E (shuffle OOM, now fixed). Even with Bug E+F patches applied, INT4 AutoRound MoE models with 40+ layers hit Level Zero error 40 (OUT_OF_RESOURCES) during warmup or first inference. Root cause: IPEX creates SYCL kernel objects per-layer with no caching. Each MoE layer generates ~7-8 queue.submit() calls per token. At 40+ layers, the ~320+ kernel objects per token exceed the Level Zero driver's resource pool on Lunar Lake iGPU. IPEX has zero device-adaptive resource management: - No iGPU detection or shared-memory awareness - No kernel sharing across layers - No periodic synchronize() to flush and reclaim resources - DPCPP_Q_SUBMIT macro creates new submissions with no reuse Documented potential mitigations: - Level Zero env vars (immediate command lists, cleanup threshold) - Periodic torch.xpu.synchronize() between layers - compute-runtime upgrade to 26.09.37435.1 - Kernel template instantiation reduction (not feasible on frozen IPEX) https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…not resource exhaustion Per-layer synchronize() diagnostic revealed the FIRST torch.xpu.moe_gemm() call with is_int4=True crashes with DEVICE_LOST. This is not resource pool exhaustion from 40+ layers — the xetla group_mm_int4_out_marlin kernel itself does not work on Lunar Lake's Xe2-LPG iGPU architecture. The "40+ layers" correlation was a red herring: all INT4 MoE models we tested happened to have 40+ layers. A 1-layer INT4 model would also crash. MXFP4 works because it uses a different xetla kernel (group_mm_mxfp4_out_marlin). Root cause: IPEX xetla_arch.h maps Lunar Lake to gpu_arch::XeHpc (data center GPU config) with Xe2Lpg commented out. The INT4 kernel dispatches tile configurations designed for PVC/BMG onto a mobile iGPU with different capabilities. Ruled out: Level Zero env vars (no effect), resource accumulation (crashes at layer 1), memory pressure (model loads fine). Added minimal reproducer: vllm/test/test_int4_moe_gemm_xpu.py https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…kernels, not kernel incompatibility Exhaustive testing proved the INT4 xetla kernel works perfectly in isolation on Lunar Lake Xe2-LPG. The crash only occurs when IPEX attention kernels (flash_attn_varlen_func / chunked_prefill) execute before moe_gemm in the same SYCL queue. Cloned tensors, synchronize(), empty_cache() all fail to clear the pollution. The fix must come from Intel (IPEX or Level Zero driver). https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Add pure Python routing test results: even with zero IPEX ops between attention and moe_gemm (only repeat_interleave + manual indexing), INT4 MoE GEMM still crashes. Also added sync+empty_cache+gc test — full cleanup before moe_gemm doesn't help. This conclusively proves the Level Zero SYCL context is irrecoverably poisoned by attention kernels. New file: issues/bug-h-intel-report.md — ready-to-file bug report for Intel IPEX/compute-runtime GitHub with full reproduction steps and elimination evidence. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…stic Reading IPEX xetla kernel source reveals 5 critical differences between INT4 and MXFP4 MoE GEMM: different kernel dispatch model (non-persistent vs persistent), 4x larger workgroup tiles (256x256 vs 128x128), different input dtype (FP16 vs BF16), different compute policies, and different scale types. Primary suspect: INT4's 256x256 GEMM tile exceeds Xe2-LPG resources after attention leaves residual SYCL state. New test script tests each tile policy threshold to determine if smaller tiles (GEMV 8x64) work after attention while GEMM (256x256) crashes. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…INT4 MoE Critical finding: IPEX #838 reports identical OUT_OF_RESOURCES crash with GatedMLPMOE on Qwen3-30B-A3B (128 experts) on discrete GPU Max 1550. This proves Bug H is NOT specific to Lunar Lake iGPU or attention state pollution — it's a GatedMLPMOE scaling bug affecting all Intel GPUs. Issue intel#324 was a red herring: that user ran dense Qwen3.5-27B (not MoE), which uses IPEXGPTQLinearMethod instead of GatedMLPMOE. #838 closed 2026-01-05 with no public fix. Intel has internal fix but hasn't released it in any pip package or Docker image. https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…, cross-platform bugs - IPEX #838 had zero PRs; fix was never released, repo archived March 30 2026 - vllm-xpu-kernels CUTLASS XE PRs intel#88/intel#98/intel#114 replace GatedMLPMOE - Expert scaling fixes in progress: PRs intel#252 (1024 experts), intel#253 (128-256) - Cross-platform: 128-expert Qwen3-30B-A3B crashes on NVIDIA too (vLLM #35922, SGLang #9872) - Added Llama-4-Scout-17B-16E (16 experts, works) to threshold table - Other unfixed IPEX issues: #864 (GPT-OSS-20B-Int4), #869 (CPU offload) - Updated path forward: vLLM v0.16+ with vllm-xpu-kernels is recommended migration https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…LM integration pending - Repo moved from intel/vllm-xpu-kernels to vllm-project/vllm-xpu-kernels - INT4 MoE CUTLASS kernels exist since v0.1.0 (PRs intel#88, intel#98, intel#114 merged) - vLLM v0.19.0 pins vllm_xpu_kernels==0.1.4 in requirements/xpu.txt - BUT: vLLM model-layer INT4 MoE integration NOT done yet (RFC #33214) - INT4 dense GEMM is WIP (vllm#33662), INT4 MoE is "Planned" with no PR - MXFP4 MoE, FP8 MoE, unquantized MoE all fully working in v0.19 - Added vllm-xpu-kernels release history (v0.1.0 through v0.1.5) - Added two-layer status table (kernel library vs vLLM integration) - Corrected migration options: writing integration PR is now option #1 https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Files
issues/glm4-mla-xpu-bugs.md— 3-bug MLA issue writeupissues/glm4_moe_lite_int4_xpu_marlin_shuffle.md— MoE OOM investigationissues/vllm-30359-comment.md— upstream vLLM issue comment draftscripts/fix_glm4_mla.sh— auto-fix scriptvllm/patches/glm4_moe_lite_mla_xpu.patch— unified patch