Skip to content

docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU#334

Open
MegaStood wants to merge 39 commits intointel:mainfrom
MegaStood:claude/check-lunar-lake-compatibility-CB5w6
Open

docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU#334
MegaStood wants to merge 39 commits intointel:mainfrom
MegaStood:claude/check-lunar-lake-compatibility-CB5w6

Conversation

@MegaStood
Copy link
Copy Markdown

@MegaStood MegaStood commented Mar 25, 2026

Summary

  • 3-fix patch to enable MLA (Multi-head Latent Attention) for GLM-4.7-Flash on XPU: whitelist fix, TRITON_MLA routing, XPU flash_attn import
  • MLA reduces KV cache 17.5x (3.67 GiB → 0.21 GiB for 4096 tokens)
  • MoE marlin_shuffle_weight OOM investigation: 5 approaches tested, all blocked by 32GB shared memory limitation
  • Auto-fix script and unified patch file included

Files

  • issues/glm4-mla-xpu-bugs.md — 3-bug MLA issue writeup
  • issues/glm4_moe_lite_int4_xpu_marlin_shuffle.md — MoE OOM investigation
  • issues/vllm-30359-comment.md — upstream vLLM issue comment draft
  • scripts/fix_glm4_mla.sh — auto-fix script
  • vllm/patches/glm4_moe_lite_mla_xpu.patch — unified patch

@MegaStood MegaStood force-pushed the claude/check-lunar-lake-compatibility-CB5w6 branch from 985ca80 to 5535bc7 Compare April 2, 2026 12:38
@MegaStood MegaStood changed the title Claude/check lunar lake compatibility cb5w6 docs: GLM-4.7-Flash MLA bug analysis, patches, and MoE investigation for Lunar Lake XPU Apr 2, 2026
claude added 26 commits April 2, 2026 15:25
…uffle

Key discovery: gpt-oss-20b (also MoE) loads fine because MXFP4 stores
weights pre-formatted for Marlin kernels. INT4 AutoRound requires runtime
reshuffling that OOMs on 32GB shared memory. Added quantization format
compatibility table and MXFP4 re-quantization as most promising path.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Key findings:
- Usable GPU memory (28.57 GiB) != physical RAM (32 GiB)
- gpu-memory-utilization is fraction of usable, not physical
- Skip-profile 1.05x multiplier yields 2.67x more KV cache vs 1.2x default
- Multi-model util values must sum < 1.0 (same shared pool)
- KV cost ~48 KB/token for gpt-oss-20b

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
RAM-based MAX_JOBS auto-detection for native builds on shared-memory
iGPUs: 3 jobs for 16GB (Claw A1M), 6 for 32GB (Claw 8 AI+).

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Auto-detect system RAM at build time to set parallel compile jobs:
- 16GB (Claw A1M, Meteor Lake): MAX_JOBS=3
- 32GB (Claw 8 AI+, Lunar Lake): MAX_JOBS=6
- 64GB+: MAX_JOBS=8

Prevents OOM during vLLM compilation on shared-memory iGPU systems.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Extracts Dockerfile build steps into a standalone bash script for
bare-metal installation on Lunar Lake / Meteor Lake / Arrow Lake.

Key features:
- RAM-based MAX_JOBS: 3 for 16GB (Claw A1M), 6 for 32GB (Claw 8 AI+)
- Applies vllm_for_multi_arc.patch from repo
- Builds vLLM v0.14.0 + vllm-xpu-kernels + triton-xpu
- Configures production env vars in ~/.bashrc
- Skips already-completed steps on re-run

Usage: sudo bash vllm/scripts/install_vllm_native.sh

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Documents planned setup: 2x DGX Spark for 120B models, Claw 8 for
portable 20B MoE, Claw A1M as edge client with NPU offloading ASR/TTS.

Includes OpenVINO runtime details, NPU vs iGPU memory separation,
and model compatibility table for NPU inference on Meteor Lake.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Documents the 8-step native build order with --no-build-isolation
explanation, .so library table, and Docker vs native build comparison
with per-device recommendations.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Backports all best practices discovered during Meteor Lake 16GB testing:
- Add Level Zero + ocloc/IGC packages (prevents torch.xpu.device_count()=0)
- Add llvm-foreach symlink (fixes SYCL AOT linker failure at object 925/933)
- Switch from break-system-packages to Python 3.12 venv isolation
- Redirect TMPDIR to disk (prevents tmpfs overflow on 16GB systems)
- Use --no-deps for transformers git install (prevents CUDA torch overwrite)
- Add torch XPU verification after dependency installs
- Add idempotent skip checks (pip show) so re-runs don't rebuild
- Add chown ownership fix before xpu-kernels build
- Clean stale build artifacts and corrupted pip metadata
- Add post-build verification for vLLM and xpu-kernels
- Improve bashrc block with venv activation and oneAPI ordering note

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
On 16GB systems (Meteor Lake), the xpu-kernels build peaks at ~21GB
and will OOM without disk swap. Replaces zram with 16GB disk swapfile.
On 32GB systems (Lunar Lake), offers optional overflow swapfile and
temporarily disables zram during the build to free all RAM.

Ported from claw-post-install-vllm.sh Phase 0, adapted for Ubuntu.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Tries downloading precompiled vLLM and xpu-kernels wheels from GitHub
Releases before falling back to source compilation. Saves 1-2 hours
on 16GB systems. Set VLLM_BUILD_FROM_SOURCE=1 to force source build.

Wheel upload instructions:
  cd ~/llm/vllm && pip wheel --no-build-isolation --no-deps -w dist/ .
  cd ~/llm/vllm-xpu-kernels && pip wheel --no-build-isolation --no-deps -w dist/ .
  gh release create vllm-xpu-wheels-v0.14.0 dist/*.whl

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Documents that xccl and ccl distributed backends hang during Level Zero
initialization on Meteor Lake's Xe-LPG iGPU. Workaround: set
VLLM_XPU_DIST_BACKEND=gloo. Includes architecture comparison table
(Xe-LPG vs Xe2-LPG vs Xe-HPG) and references.

Discovered during Claw A1M (16GB Meteor Lake) testing. Lunar Lake and
discrete Arc GPUs are not affected.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Meteor Lake Xe-LPG lacks XMX (matrix extensions), causing vLLM to crash
on first inference with "SDP kernel requires XMX". This is a hardware
limitation with no software workaround in vLLM. Documents IPEX-LLM
(Ollama) as the only viable alternative for LLM inference on Meteor Lake.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Add potential future workaround via PyTorch >= 2.9 SDPBackend.MATH on
XPU (PR #156669). Document that SYCL joint_matrix has no XMX fallback
per Intel docs, and that no IPEX env var exists to disable XMX kernels.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Add detailed alternatives section: IPEX-LLM detects Meteor Lake as "mtl"
with architecture-specific optimizations, llama.cpp SYCL backend is
verified, OpenVINO also works. Add DP4a vs DPAS performance context and
first-run JIT compilation note.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
IPEX-LLM/Ollama is GGUF-only, so new model architectures are delayed.
Added OpenVINO GenAI as recommended option for new model support (auto-
converts HuggingFace models, no XMX needed). Added HF Transformers with
eager attention as prototyping option. Added decision table for choosing
between alternatives. Added llama.cpp Vulkan as minimal-dependency option.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…on Lunar Lake

Qwen3-4B INT4 on MSI Claw A1M (Meteor Lake, no XMX): 26.4 tokens/s
decode, 176ms TTFT. Comparable to vLLM on Lunar Lake with XMX.
Bottleneck is memory bandwidth, not compute — XMX absence has less
impact than expected for 4B-class models.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…Lake

Root cause: quadratic attention memory growth (O(n²)) during prefill
causes GPU OOM at 8K tokens on 16GB shared memory systems.

Documents workarounds: ContinuousBatchingPipeline with chunked prefill
(dynamic_split_fuse), CPU fallback for long context, sliding window
models (Qwen3.5-4B), KV cache eviction, BIOS GPU memory settings.

Includes memory calculation tables and path to 32K context for OpenClaw.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…issue refs

- Critical: enable_prefix_caching=False avoids integer overflow in
  openvino.genai#2406 (Qwen3-4B specific memory explosion)
- Add CacheEvictionConfig parameters (max_cache_size, start_size, recent_size)
- Add NPU prefill option with NPUW_LLM_PREFILL_CHUNK_SIZE
- Reference openvino#31781, #32665, openvino_notebooks#2632

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…lete 32K path

Key additions:
- KV_CACHE_PRECISION="u4" reduces 32K KV cache from 4.5GB to ~1.1GB
- Correct max_num_batched_tokens default (256, not 2048)
- Explain StatefulLLMPipeline has NO chunked prefill (must use SchedulerConfig)
- Complete recommended 32K config with memory budget calculation
- Note CPU/GPU KV cache transfer not supported in OpenVINO GenAI

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
OpenAI-compatible API server (FastAPI) using ContinuousBatchingPipeline
with chunked prefill, INT4/INT8 KV cache, and prefix caching disabled.

Supports /v1/chat/completions, /v1/completions, /v1/models endpoints.
Designed for OpenClaw integration on MSI Claw handhelds (16-32GB).

Default config: GPU device, u8 KV cache, 3GB cache budget, 512-token
chunks — handles up to 32K input context on 16GB Meteor Lake iGPU.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
PR #15307 merged March 2026 — OpenVINO is now a native GGML backend.
GGUF files work directly (no conversion), supports CPU/GPU/NPU.
Benchmark: ~12.8 tok/s for 8B INT4, ~25 tok/s estimated for 4B INT4
on Meteor Lake iGPU (memory-bandwidth-bound, scales with model size).

Updated choosing table with OpenVINO backend and NPU option.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…penVINO)

Critical finding: Meteor Lake + Vulkan = crippled prefill (~14x slower
than SYCL) because Xe-LPG lacks VK_KHR_cooperative_matrix support.
OpenVINO backend recommended over Vulkan for Meteor Lake iGPU.

Includes benchmark table with pp512/tg128 numbers across backends,
sources from llama.cpp discussions, and Intel maintainer's recommendation.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…L benchmarks

Tested on MSI Claw A1M (Meteor Lake iGPU) — OpenVINO backend fails with
all K-quant models (Q4_K_M, Q4_K_XL): "failed to decode prompt batch,
res = -3". CPY operation not implemented, breaking flash attention.

Replaces speculative benchmark table with real measurements:
- Gemma 4 E4B Q4_K_M: SYCL pp512=309, pp32768=258, tg128=15.1
- Vulkan: pp512=318, pp32768=172, tg128=14.2
- OpenVINO: completely non-functional with standard quants

Updates recommendation table: SYCL replaces OpenVINO for GGUF workflows.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Previous data was single-run. Updated with proper 5-run averages
showing SYCL faster at all context lengths. Added extended TG columns.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
AOT compilation with GGML_SYCL_DEVICE_ARCH=mtl_h shows no improvement
over default JIT build: prefill identical, token generation ~4% slower.
Recommendation updated to use default SYCL build (JIT).

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
GGML_SYCL_F16=ON gives 463 tok/s prefill (vs 311 FP32) at pp512,
and 345 vs 257 at pp32768 (1.8x faster than Vulkan at 32K).
TG slightly slower (~5%) — tradeoff is compute-bound vs memory-bound.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
claude added 12 commits April 10, 2026 10:13
SYCL FP16 benchmarked at +49% faster prefill. Updated build example
and recommendation text in Option 4 (llama.cpp SYCL/Vulkan).

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…14.0

Gemma 4 model architecture (Gemma4ForCausalLM) requires vLLM v0.19.0+.
Current llm-scaler patches are based on v0.14.0 and only support up to
Gemma 3. Documents the AutoRound MoE code path analysis showing that
XPUGPTQMarlinMoEMethod (IPEX GatedMLPMOE, no marlin_shuffle) would work
once the architecture is backported, and the missing FusedMoE handling
in apply_ipex_quant_layer that needs to be added.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…_vllm_native.sh

Brings the Ubuntu install script up to parity with install_lunar_lake.sh
and install_meteor_arrow_lake.sh:

- Add GPU detection (Lunar/Meteor/Arrow Lake PCI device IDs)
- Add xe vs i915 driver check with GRUB migration instructions
- Add memory-based gpu-memory-utilization recommendations
- Add MKL LD_PRELOAD fix for PyTorch venv RPATH breakage
- Add xpu_worker.py CCL all_reduce warmup patch (single-GPU fix)
- Add vllm-activate and oneapi convenience aliases
- Fix setvars.sh sourcing with set +e (unbound vars in setvars.sh)
- Add Python 3.14+ detection with auto-fallback to python3.12
- Improve summary with dynamic platform/memory recommendations

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
vLLM v0.19.0 has native Gemma 4 + Intel XPU support upstream — no
vllm_for_multi_arc.patch needed. Installs alongside v0.14.0 in a
separate directory (~/llm-scaler-vllm-v19/) so both can coexist.

Key differences from v0.14.0 script:
- No multi-arc patch (XPU support is upstream)
- vllm-xpu-kernels v0.1.5 (latest, vs 4c83144)
- transformers>=5.5.0 force-installed (vLLM pins <5, Gemma 4 needs >=5.5)
- torch XPU overwrite protection after pip installs
- Separate aliases (vllm-v19-activate) to coexist with v0.14.0

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…r INT4 MoE

Bug H is distinct from Bug E (shuffle OOM, now fixed). Even with Bug E+F
patches applied, INT4 AutoRound MoE models with 40+ layers hit Level Zero
error 40 (OUT_OF_RESOURCES) during warmup or first inference.

Root cause: IPEX creates SYCL kernel objects per-layer with no caching.
Each MoE layer generates ~7-8 queue.submit() calls per token. At 40+ layers,
the ~320+ kernel objects per token exceed the Level Zero driver's resource
pool on Lunar Lake iGPU.

IPEX has zero device-adaptive resource management:
- No iGPU detection or shared-memory awareness
- No kernel sharing across layers
- No periodic synchronize() to flush and reclaim resources
- DPCPP_Q_SUBMIT macro creates new submissions with no reuse

Documented potential mitigations:
- Level Zero env vars (immediate command lists, cleanup threshold)
- Periodic torch.xpu.synchronize() between layers
- compute-runtime upgrade to 26.09.37435.1
- Kernel template instantiation reduction (not feasible on frozen IPEX)

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…not resource exhaustion

Per-layer synchronize() diagnostic revealed the FIRST torch.xpu.moe_gemm()
call with is_int4=True crashes with DEVICE_LOST. This is not resource pool
exhaustion from 40+ layers — the xetla group_mm_int4_out_marlin kernel
itself does not work on Lunar Lake's Xe2-LPG iGPU architecture.

The "40+ layers" correlation was a red herring: all INT4 MoE models we
tested happened to have 40+ layers. A 1-layer INT4 model would also crash.
MXFP4 works because it uses a different xetla kernel (group_mm_mxfp4_out_marlin).

Root cause: IPEX xetla_arch.h maps Lunar Lake to gpu_arch::XeHpc (data
center GPU config) with Xe2Lpg commented out. The INT4 kernel dispatches
tile configurations designed for PVC/BMG onto a mobile iGPU with different
capabilities.

Ruled out: Level Zero env vars (no effect), resource accumulation (crashes
at layer 1), memory pressure (model loads fine).

Added minimal reproducer: vllm/test/test_int4_moe_gemm_xpu.py

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…kernels, not kernel incompatibility

Exhaustive testing proved the INT4 xetla kernel works perfectly in isolation on
Lunar Lake Xe2-LPG. The crash only occurs when IPEX attention kernels
(flash_attn_varlen_func / chunked_prefill) execute before moe_gemm in the same
SYCL queue. Cloned tensors, synchronize(), empty_cache() all fail to clear the
pollution. The fix must come from Intel (IPEX or Level Zero driver).

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Add pure Python routing test results: even with zero IPEX ops between
attention and moe_gemm (only repeat_interleave + manual indexing), INT4
MoE GEMM still crashes. Also added sync+empty_cache+gc test — full
cleanup before moe_gemm doesn't help. This conclusively proves the
Level Zero SYCL context is irrecoverably poisoned by attention kernels.

New file: issues/bug-h-intel-report.md — ready-to-file bug report for
Intel IPEX/compute-runtime GitHub with full reproduction steps and
elimination evidence.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…stic

Reading IPEX xetla kernel source reveals 5 critical differences between
INT4 and MXFP4 MoE GEMM: different kernel dispatch model (non-persistent
vs persistent), 4x larger workgroup tiles (256x256 vs 128x128), different
input dtype (FP16 vs BF16), different compute policies, and different
scale types. Primary suspect: INT4's 256x256 GEMM tile exceeds Xe2-LPG
resources after attention leaves residual SYCL state.

New test script tests each tile policy threshold to determine if smaller
tiles (GEMV 8x64) work after attention while GEMM (256x256) crashes.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…INT4 MoE

Critical finding: IPEX #838 reports identical OUT_OF_RESOURCES crash with
GatedMLPMOE on Qwen3-30B-A3B (128 experts) on discrete GPU Max 1550.
This proves Bug H is NOT specific to Lunar Lake iGPU or attention state
pollution — it's a GatedMLPMOE scaling bug affecting all Intel GPUs.

Issue intel#324 was a red herring: that user ran dense Qwen3.5-27B (not MoE),
which uses IPEXGPTQLinearMethod instead of GatedMLPMOE.

#838 closed 2026-01-05 with no public fix. Intel has internal fix but
hasn't released it in any pip package or Docker image.

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…, cross-platform bugs

- IPEX #838 had zero PRs; fix was never released, repo archived March 30 2026
- vllm-xpu-kernels CUTLASS XE PRs intel#88/intel#98/intel#114 replace GatedMLPMOE
- Expert scaling fixes in progress: PRs intel#252 (1024 experts), intel#253 (128-256)
- Cross-platform: 128-expert Qwen3-30B-A3B crashes on NVIDIA too (vLLM #35922, SGLang #9872)
- Added Llama-4-Scout-17B-16E (16 experts, works) to threshold table
- Other unfixed IPEX issues: #864 (GPT-OSS-20B-Int4), #869 (CPU offload)
- Updated path forward: vLLM v0.16+ with vllm-xpu-kernels is recommended migration

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
…LM integration pending

- Repo moved from intel/vllm-xpu-kernels to vllm-project/vllm-xpu-kernels
- INT4 MoE CUTLASS kernels exist since v0.1.0 (PRs intel#88, intel#98, intel#114 merged)
- vLLM v0.19.0 pins vllm_xpu_kernels==0.1.4 in requirements/xpu.txt
- BUT: vLLM model-layer INT4 MoE integration NOT done yet (RFC #33214)
- INT4 dense GEMM is WIP (vllm#33662), INT4 MoE is "Planned" with no PR
- MXFP4 MoE, FP8 MoE, unquantized MoE all fully working in v0.19
- Added vllm-xpu-kernels release history (v0.1.0 through v0.1.5)
- Added two-layer status table (kernel library vs vLLM integration)
- Corrected migration options: writing integration PR is now option #1

https://claude.ai/code/session_01JyMJU94Dq32vYBGMoMJM34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants