Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
1ef4516
Add Lunar Lake Xe2 140V compatibility report
claude Mar 20, 2026
5593891
Add experimental Lunar Lake Xe2 iGPU support
claude Mar 25, 2026
1d4c46d
Add native install script for Lunar Lake on Nobara/Fedora
claude Mar 25, 2026
3ad2329
Fix install script: move Level-Zero after Intel repo setup
claude Mar 25, 2026
6cd5763
Fix install script: skip MPI in setvars.sh + handle Python 3.14
claude Mar 26, 2026
dc61425
Fix setvars.sh killing script: disable strict mode during source
claude Mar 26, 2026
cc3f5af
Fix setvars.sh source running in subshell due to pipe
claude Mar 26, 2026
9aedeff
Fix PyTorch XPU install: use correct index URL without +xpu suffix
claude Mar 26, 2026
efbf626
Pin PyTorch XPU to 2.10.0 with matching torchvision/torchaudio versions
claude Mar 26, 2026
16507ff
Add intel-compute-runtime to fix XPU device count zero
claude Mar 26, 2026
a235798
Remove tail pipes so build progress is visible
claude Mar 26, 2026
58b7a7c
Fix vLLM detection: use pip show instead of import
claude Mar 26, 2026
adf3638
Remove remaining tail pipes and retry stale xpu-kernels build
claude Mar 26, 2026
4f6d9be
Fix MKL fatal error and xpu-kernels OOM on Lunar Lake
claude Mar 26, 2026
9543306
Update xpu-kernels build: MAX_JOBS=6, add build time warning
claude Mar 26, 2026
0ae7717
Fix CCL init failure, --device xpu bug, and add OOM workarounds for L…
claude Mar 26, 2026
25ba10c
Document Xe2 compatibility blockers: Triton, Marlin, and model testin…
claude Mar 26, 2026
8840b82
bench: add vllm benchmark results 2026-03-27 (13.6 tok/s, qwen3-8b-int4)
claude Mar 26, 2026
fcfb119
bench: add concurrency 2 and 5 benchmark results (25.5 / 19.3 tok/s)
claude Mar 26, 2026
e6262f3
bench: add concurrency 5 capped test + final summary (36.4 tok/s, no …
claude Mar 26, 2026
1f2ff7f
Add vLLM SYCL benchmark results for Qwen3-8B on Lunar Lake Arc 140V
claude Mar 27, 2026
56ae5a5
Merge pull request #1 from MegaStood/claude/check-lunar-lake-compatib…
MegaStood Mar 27, 2026
505b3cf
bench: add Qwen3-ASR-1.7B benchmark results (81.9 tok/s at concurrenc…
Mar 27, 2026
99bf79d
Add Qwen3-TTS setup recipe for Lunar Lake XPU
claude Mar 27, 2026
c6dd7cb
Add running recipes and Qwen3-ASR setup for Lunar Lake
claude Mar 27, 2026
968fcaf
Add Qwen3-30B MoE GPTQ and Coder AWQ to failed models list
claude Mar 27, 2026
6667f99
Add Meteor Lake and Arrow Lake iGPU support
claude Mar 29, 2026
07a32b9
Add server-side benchmark data, Qwen3.5/GLM-4.7 failure analysis, and…
claude Mar 29, 2026
90397b7
Document transformers 5.x max_pixels fix and Intel's vllm_for_multi_a…
claude Mar 29, 2026
f18d261
Update Qwen3.5-4B finding: loads at 3.68 GiB but hits Triton wall
claude Mar 29, 2026
fb75c9f
Confirm Triton Xe2 blocker is hardware-level, not a missing patch
claude Mar 29, 2026
87ad195
Fix triton-xpu packaging bug: uninstall plain triton before installin…
claude Mar 29, 2026
ff577f5
Add Qwen3.5-4B benchmark results: 23.4 tok/s single, 159 tok/s batche…
claude Mar 29, 2026
4291ca6
Add Qwen3.5-4B 32K context config and gpu-memory-utilization guide
claude Mar 30, 2026
d887f49
Add Qwen3.5-4B batched benchmarks at 0.35 util / 32K context
claude Mar 30, 2026
38873d4
Add Qwen3.5-4B benchmarks at 0.42 gpu-memory-utilization (OpenClaw mode)
claude Mar 30, 2026
3ed156e
Add complete side-by-side benchmark: 0.35 vs 0.42 vs 0.8 gpu-memory-u…
claude Mar 30, 2026
d61bdb7
Add tool calling and reasoning parser to Qwen3.5-4B OpenClaw recipe
claude Mar 30, 2026
e2f7700
Fill in 0.8 util benchmarks with standardized 8K context data
claude Mar 30, 2026
c419a03
Update 0.8 util batched data with warmed-up run: high TTFT is real
claude Mar 30, 2026
fce82b6
Add KV headroom and batched TTFT to configuration summary table
claude Mar 30, 2026
09da442
Update GLM-4.7-flash failure modes with detailed root cause analysis
claude Mar 30, 2026
2eaf5df
Add Qwen3.5-9B distilled benchmark results (BF16, too slow for intera…
claude Mar 30, 2026
632eb52
Update Qwen3-8B benchmarks with verified re-run data (2048 context ad…
claude Mar 30, 2026
8dd8fda
Add Qwen3.5-9B distilled FP8 benchmark results (37% smaller, 1.75x fa…
claude Mar 30, 2026
61bdba8
Add sym_int4 build instructions and update quantization status
claude Mar 31, 2026
eb63930
Add sym_int4 benchmark results and correct .so requirements
claude Mar 31, 2026
493e177
Correct sym_int4 findings: .so IS required, loaded via ctypes
claude Mar 31, 2026
bdfa16f
Add pre-built vllm_int4_for_multi_arc.so for Lunar Lake sym_int4
MegaStood Mar 31, 2026
04f7a33
Document the 12KB .so discovery that saved 10GB of Docker downloads
claude Mar 31, 2026
4baab9a
Complete sym_int4 benchmarks with matching I/O (1024/1024, 2048/2048)
claude Mar 31, 2026
0104f50
Add gpt-oss-20b findings, model size limits, and upstream fix status
claude Mar 31, 2026
90c050f
Add vLLM #30359 comment with Lunar Lake AutoRound loading peak findings
claude Mar 31, 2026
7edc58c
Correct loading peak analysis: layer-by-layer processing, not 2x bulk
claude Mar 31, 2026
8a84a8b
fix: skip profile_run() on Lunar Lake to prevent hang during KV cache…
MegaStood Mar 31, 2026
25cf073
fix: re-export xpu_worker patch with VLLM_SKIP_PROFILE_RUN logic
MegaStood Mar 31, 2026
e33dbe2
Add standard benchmark script for Lunar Lake vLLM testing
claude Mar 31, 2026
7c3aba6
docs: add gpt-oss-20b running recipe for Lunar Lake
MegaStood Mar 31, 2026
451806a
Add gpt-oss-20b MXFP4 benchmark results: 22.5 tok/s single-user
claude Mar 31, 2026
7c1d8ab
Update gpt-oss-20b recipe: 32K context, tool calling, reasoning parser
claude Apr 1, 2026
5796fe1
Fix KV cache concurrency: 88,576 / 32,768 = ~2.7x, not 5x
claude Apr 1, 2026
f9d3ec2
Update gpt-oss-20b benchmarks: 32K context + tool calling + 16K stres…
claude Apr 1, 2026
45316b7
Add 16K single-user benchmark: 15.7 tok/s decode, prefix cache analysis
claude Apr 1, 2026
02163dc
Add cold-start 16K prefill benchmark: 20.3s TTFT, ~807 tok/s prefill
claude Apr 1, 2026
0a4bb5f
Organize and finalize Lunar Lake compatibility doc
claude Apr 1, 2026
c0065fc
Merge pull request #2 from MegaStood/claude/check-lunar-lake-compatib…
MegaStood Apr 1, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
986 changes: 986 additions & 0 deletions LUNAR_LAKE_COMPATIBILITY.md

Large diffs are not rendered by default.

Binary file added artifacts/vllm_int4_for_multi_arc.so
Binary file not shown.
70 changes: 70 additions & 0 deletions benchmarks/qwen3-asr-benchmark-2026-03-27.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,70 @@
# Qwen3-ASR-1.7B Benchmark — 2026-03-27

## Model Info
- **Model:** Qwen3-ASR-1.7B (`/shared/models/qwen3-asr-1.7b`)
- **Server:** vLLM (Intel Arc XPU, port 8000)
- **Audio input:** espeak-ng generated WAV files (~2-3s speech clips)
- **Language:** English (auto-detected)
- **GPU:** Intel Arc 140V (28.5 GB unified memory)

## Single Transcription Test

**Input text:** "Hello, this is a test of the Qwen 3 ASR model. The quick brown fox jumps over the lazy dog. One two three four five."

**Transcription output:** `language English<asr_text>Hello. This is a test of the QN3ASR model. The quick brown fox jumps over the lazy dog. One, two, three, four, five.`

- Latency: **8.67s**
- Accuracy: Near-perfect (only "Qwen3" → "QN3ASR" slightly off, expected with synthetic TTS voice)
- Output tokens: 39

## Concurrency Benchmark (RAM-Monitored)

**Test sentences used:**
1. "The weather today is sunny with a high of twenty five degrees celsius."
2. "Artificial intelligence is transforming the way we work and communicate."
3. "Please confirm your reservation for three guests arriving on Friday evening."
4. "The stock market closed higher today driven by technology sector gains."
5. "Can you recommend a good restaurant near the city center for dinner tonight."

### Results

| Concurrency | Wall time | Total tokens | Agg tok/s | Peak RAM | Delta RAM |
|---|---|---|---|---|---|
| 1 | 2.26s | 19 | 8.4 | 4 MB | +0 MB |
| 2 | 1.01s | 35 | 34.8 | 4 MB | +0 MB |
| 5 | 1.05s | 86 | **81.9** | 4 MB | +0 MB |

### Per-Worker Detail (Concurrency 5)

| Worker | Tokens | Time | Transcription |
|---|---|---|---|
| 0 | 19 | 1.05s | "The weather today is sunny with a high of 25 degrees C..." |
| 1 | 16 | 0.95s | "Artificial intelligence is transforming the way we wor..." |
| 2 | 16 | 0.95s | "Please confirm your reservation for three guests arriv..." |
| 3 | 17 | 0.99s | "The stock market closed higher today, driven by techno..." |
| 4 | 18 | 1.04s | "Can you recommend a good restaurant near the city cent..." |

## Comparison vs Qwen3-8B-INT4

| Model | Concurrency | Agg tok/s | RAM | Notes |
|---|---|---|---|---|
| Qwen3-8B-INT4 | 1 | 13.6 | ~20 GB | Text generation |
| Qwen3-8B-INT4 | 2 | 25.5 | ~20 GB | Text generation |
| Qwen3-8B-INT4 | 5 (capped) | 36.4 | ~20 GB | Text generation |
| **Qwen3-ASR-1.7B** | 1 | 8.4 | **4 MB** | Speech transcription |
| **Qwen3-ASR-1.7B** | 2 | 34.8 | **4 MB** | Speech transcription |
| **Qwen3-ASR-1.7B** | 5 | **81.9** | **4 MB** | Speech transcription |

## Key Observations

- **Tiny RAM footprint:** Only 4 MB RSS (GPU VRAM handles everything) vs ~20 GB for 8B model
- **Excellent concurrency scaling:** Near-linear scaling up to 5 concurrent requests (~1s wall time)
- **No degradation at concurrency 5:** All workers complete in ~1s, no timeouts
- **High throughput:** 81.9 aggregate tok/s at concurrency 5
- **Fits alongside other models:** At 1.7B, leaves plenty of the 28.5 GB unified memory for other models

## Setup Notes

- Required `vllm[audio]` extra: `pip install "vllm[audio]"`
- Audio input format: `audio_url` content type with base64-encoded WAV
- Note: Qwen3-VL-8B not yet supported in vLLM 0.14 (Intel Arc build) — awaiting upstream update
117 changes: 117 additions & 0 deletions benchmarks/vllm-benchmark-2026-03-27.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,117 @@
## vLLM Server Benchmark — 2026-03-27

**Test Configuration:**
- Model: /shared/models/qwen3-8b-int4-autoround
- Input tokens: ~128 (random data, actual: 337)
- Max output tokens: 7800
- Concurrency: 1
- Endpoint: http://127.0.0.1:8000/v1/chat/completions

**Results:**
- Status: OK
- Finish reason: stop (natural end)
- Prompt tokens: 337
- Completion tokens: 2,923
- Time elapsed: 215.36s
- Throughput: 13.6 tok/s

**Notes:**
- Context window is 8192 tokens total; max usable output with ~337 input tokens is ~7,849
- Model stopped naturally at 2,923 tokens rather than hitting the 7,800 limit
- INT4 quantized 8B model on local inference server

---

## Concurrency Tests

### Concurrency 2

| Metric | Value |
|---|---|
| Wall time | 184.81s |
| Total output tokens | 4,714 |
| Aggregate throughput | 25.5 tok/s |

| Worker | Tokens | Time | Tok/s | Finish |
|---|---|---|---|---|
| 0 | 2,304 | 176.5s | 13.1 | stop |
| 1 | 2,410 | 184.8s | 13.0 | stop |

### Concurrency 5

| Metric | Value |
|---|---|
| Wall time | 600.08s |
| Total output tokens | 11,565 |
| Aggregate throughput | 19.3 tok/s |

> Note: Worker 1 timed out at 600s and did not return results.

| Worker | Tokens | Time | Tok/s | Finish |
|---|---|---|---|---|
| 0 | 1,396 | 128.0s | 10.9 | stop |
| 2 | 3,565 | 335.7s | 10.6 | stop |
| 3 | 2,613 | 244.1s | 10.7 | stop |
| 4 | 3,991 | 374.3s | 10.7 | stop |

## Summary

| Concurrency | Aggregate tok/s | Per-worker tok/s |
|---|---|---|
| 1 | 13.6 | 13.6 |
| 2 | 25.5 | ~13.0 |
| 5 | 19.3 | ~10.7 |

**Observations:**
- Concurrency 2 nearly doubles aggregate throughput vs single (25.5 vs 13.6 tok/s), per-worker speed unchanged
- Concurrency 5 shows aggregate throughput drops to 19.3 tok/s — per-worker latency degrades (~10.7 tok/s), suggesting GPU memory/compute saturation
- 1 worker timed out at concurrency 5 (>600s), indicating queue pressure at high concurrency

---

## Concurrency 5 — Capped Output Test (max_tokens=3000, timeout=900s)

**Goal:** Verify no worker timeouts when output length is capped.

**Test Configuration:**
- Model: /shared/models/qwen3-8b-int4-autoround
- Max output tokens: 3,000 (capped)
- Timeout: 900s
- Concurrency: 5

**Results:**

| Metric | Value |
|---|---|
| Wall time | 362.52s |
| Total output tokens | 13,183 |
| Aggregate throughput | 36.4 tok/s |
| Peak RAM (vLLM) | 45 MB (baseline: 18 MB, delta: +28 MB) |

| Worker | Tokens | Time | Tok/s | Finish |
|---|---|---|---|---|
| 0 | 3,000 | 362.5s | 8.3 | length (hit cap) |
| 1 | 2,994 | 362.0s | 8.3 | stop |
| 2 | 1,237 | 141.4s | 8.7 | stop |
| 3 | 2,952 | 356.5s | 8.3 | stop |
| 4 | 3,000 | 362.5s | 8.3 | length (hit cap) |

**Outcome:** ✅ No timeouts — all 5 workers completed successfully.

**Observations:**
- Capping max_tokens=3000 eliminates timeout risk at concurrency 5
- Aggregate throughput jumps to 36.4 tok/s (vs 19.3 tok/s uncapped) due to shorter wall time
- Per-worker speed drops to ~8.3 tok/s under 5-way concurrency (vs 13.8 tok/s single)
- RAM delta only +28 MB — GPU VRAM is the real constraint, not system RAM
- Workers 0 and 4 hit the length cap (3000 tok), indicating the model wanted to generate more

## Final Summary

| Concurrency | max_tokens | Aggregate tok/s | Per-worker tok/s | Timeouts | Peak RAM |
|---|---|---|---|---|---|
| 1 | 7800 | 13.6 | 13.6 | 0/1 | 79 MB |
| 2 | 7800 | 25.5 | ~13.0 | 0/2 | 79 MB |
| 5 | 7800 | 19.3 | ~10.7 | 2/5 | 80 MB |
| 5 | 3000 | 36.4 | ~8.3 | 0/5 | 45 MB |

**Recommendation:** Concurrency 2 with uncapped output for quality; concurrency 5 with max_tokens≤3000 for maximum throughput.
49 changes: 49 additions & 0 deletions issues/vllm-30359-comment.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,49 @@
# Comment for vLLM Issue #30359 (QeRL RFC)

**Post at:** https://github.com/vllm-project/vllm/issues/30359

---

## Real-world impact on Intel Lunar Lake (32GB shared memory)

We've been benchmarking LLMs on an **MSI Claw 8 AI+** (Intel Core Ultra 7 258V, Arc 140V Xe2 iGPU, 32GB LPDDR5x shared between CPU and GPU) using vLLM's XPU backend. The pre-quantized model loading peak memory problem described in this RFC is a key blocker for running larger models on shared-memory iGPU platforms.

### The problem in practice

On shared-memory iGPU, CPU and GPU share the same 32GB physical memory pool. Both `process_weights_after_loading` (layer-by-layer repacking) and initial model loading contribute to peak memory. While the layer-by-layer processing itself adds minimal overhead (~one layer's worth), the total memory pressure comes from the initial load plus runtime allocations (KV cache, IPEX kernel buffers, MoE expert shuffling).

**Models that OOM:**
- Qwen3.5-35B-A3B AutoRound INT4: ~18 GiB on disk → OOM + GPU DEVICE_LOST on 32GB
- GLM-4.7-flash AutoRound INT4: 27B model → OOM → DEVICE_LOST
- Qwen3-30B-A3B GPTQ INT4: Loads 15.7 GiB weights, OOMs during MoE expert weight shuffle

### Loading peak analysis (corrected)

Both AutoRound INT4 and sym_int4 online use **layer-by-layer** `process_weights_after_loading` — each layer is processed and the old format is freed before the next layer. So the peak is NOT "old format + new format for the entire model" but rather **initial load size + ~one layer overhead**:

| Quant Method | Initial Load | Loading Peak (shared mem) | Final Model | Single-user tok/s |
|---|---|---|---|---|
| BF16 (none) | 17.66 GiB | ~18 GiB | 17.66 GiB | ~5 |
| FP8 (online) | 17.66 GiB | ~18 GiB | 11.22 GiB | ~8.5 |
| **sym_int4 (online)** | 17.66 GiB (BF16) | **~18 GiB** | **8.11 GiB** | **14.7** |
| **AutoRound INT4** | ~9 GiB | **~9 GiB** | ~9 GiB | ~14 (est.) |

**AutoRound INT4 is the preferred quantization** when it fits — smaller on disk, lower initial load (~9 vs ~18 GiB), better quality (calibration-based), and same inference speed. sym_int4 online quantization is the fallback when no pre-quantized INT4 model exists.

### Where the OOM actually happens

For larger models like Qwen3-30B-A3B (~18 GiB INT4 on disk), the layer-by-layer weight processing itself isn't the problem — the initial load fits in 32GB. The OOM likely occurs during:
1. **MoE expert weight shuffle** — reshaping/permuting expert weights may temporarily duplicate large tensors
2. **KV cache pre-allocation** — vLLM pre-allocates KV cache blocks after model loading
3. **IPEX kernel buffer allocation** — internal buffers for quantized inference kernels
4. **CUDA graph / warmup profiling** — additional transient memory during engine startup

### What would help

1. **Memory-efficient MoE weight loading** — process MoE experts one at a time rather than shuffling all experts simultaneously
2. **Lazy KV cache allocation** — defer KV cache pre-allocation or reduce default block count on memory-constrained platforms
3. **The layerwise approach in this RFC** — while current `process_weights_after_loading` is already layer-by-layer, extending this to cover the full load-process-allocate pipeline would help memory-constrained platforms

This is especially important for **integrated GPU platforms** (Lunar Lake, Meteor Lake, Arrow Lake) where CPU and GPU share the same memory pool — every byte of transient allocation during startup competes with the model weights and KV cache.

Full findings documented at: https://github.com/MegaStood/llm-scaler/blob/claude/check-lunar-lake-compatibility-CB5w6/LUNAR_LAKE_COMPATIBILITY.md
143 changes: 143 additions & 0 deletions vllm/docker/Dockerfile.lunar-lake
Original file line number Diff line number Diff line change
@@ -0,0 +1,143 @@
# Copyright (C) 2025 Intel Corporation
# SPDX-License-Identifier: Apache-2.0
#
# Lunar Lake Xe2 iGPU variant — lightweight single-GPU image
# Target: Intel Core Ultra (Lunar Lake) with Arc 140V integrated GPU
# Memory: Shared LPDDR5x (~24GB usable for GPU from 32GB system)

# ======== Base Stage ========
FROM intel/deep-learning-essentials:2025.2.2-0-devel-ubuntu24.04 AS vllm-lunar-lake-base

ARG https_proxy
ARG http_proxy

ENV DEBIAN_FRONTEND=noninteractive

# Install base dependencies (lighter than discrete GPU stack)
RUN set -eux; \
apt-get update; \
apt-get install -y --no-install-recommends \
pciutils \
sudo \
curl \
wget \
vim \
git \
libdrm2 \
libpciaccess0 \
xz-utils \
numactl \
&& apt-get clean && rm -rf /var/lib/apt/lists/*

# Add Intel oneAPI repo and GPU PPA
RUN wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null && \
echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | tee /etc/apt/sources.list.d/oneAPI.list && \
add-apt-repository -y ppa:kobuk-team/intel-graphics

RUN apt-get update -y && \
apt-get install -y python3.12 python3.12-dev python3-pip && \
update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.12 1 && \
update-alternatives --install /usr/bin/python python /usr/bin/python3.12 1 && \
apt-get install -y --no-install-recommends --fix-missing \
curl \
ffmpeg \
git \
libsndfile1 \
libsm6 \
libxext6 \
libaio-dev \
libgl1 \
lsb-release \
wget \
linux-libc-dev \
intel-oneapi-dpcpp-ct=2025.2.0-517 && \
apt-get clean && rm -rf /var/lib/apt/lists/*

WORKDIR /llm
COPY ./patches/vllm_for_multi_arc.patch /tmp/

# Environment for single iGPU — no multi-GPU, no P2P
ENV LD_LIBRARY_PATH="$LD_LIBRARY_PATH:/usr/local/lib/"
ENV VLLM_TARGET_DEVICE=xpu
ENV VLLM_WORKER_MULTIPROC_METHOD=spawn
# Disable multi-GPU features not applicable to iGPU
ENV CCL_TOPO_P2P_ACCESS=0
# CCL single-GPU workaround: force local TCP transport to avoid
# "fill_local_host_ip: can't find non-loopback interface" on handhelds
ENV MASTER_ADDR=127.0.0.1
ENV CCL_ZE_ENABLE=0
ENV CCL_ATL_TRANSPORT=ofi
ENV FI_PROVIDER=tcp
# Memory-aware: offload weights to CPU before quantization (essential for shared memory)
ENV VLLM_OFFLOAD_WEIGHTS_BEFORE_QUANT=1
ENV VLLM_ALLOW_LONG_MAX_MODEL_LEN=1
# Expand PyTorch memory segments for shared memory efficiency
ENV PYTORCH_ALLOC_CONF="expandable_segments:True"

RUN python3 -m pip config set global.break-system-packages true

# Clone + patch vllm
RUN --mount=type=cache,target=/root/.cache/pip \
git clone -b v0.14.0 https://github.com/vllm-project/vllm.git && \
cd vllm && \
git apply /tmp/vllm_for_multi_arc.patch && \
pip install -r requirements/xpu.txt && \
pip install arctic-inference==0.1.1 && \
export CPATH=/opt/intel/oneapi/dpcpp-ct/2025.2/include/:${CPATH} && \
pip install --no-build-isolation .

# Patch xpu_worker.py: disable CCL all_reduce warmup for single-GPU
# oneCCL's KVS init fails in containers without real network interfaces.
RUN XPU_WORKER=$(python3 -c "import vllm; import os; print(os.path.join(os.path.dirname(vllm.__file__), 'v1/worker/xpu_worker.py'))") && \
if grep -q "torch.distributed.all_reduce" "$XPU_WORKER"; then \
sed -i '/torch\.distributed\.all_reduce(/,/)/s/^/#/' "$XPU_WORKER"; \
fi

# Install pypi dependencies
RUN --mount=type=cache,target=/root/.cache/pip \
pip install bigdl-core==2.4.0b2

RUN rm -rf /tmp/*

SHELL ["bash", "-c"]

# ======== Serving Stage ========
FROM vllm-lunar-lake-base AS vllm-lunar-lake

ARG http_proxy
ARG https_proxy

RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate hf_transfer 'modelscope!=1.15.0'

RUN --mount=type=cache,target=/root/.cache/pip \
pip install librosa soundfile decord && \
pip install git+https://github.com/huggingface/transformers.git && \
pip install ijson

RUN --mount=type=cache,target=/root/.cache/pip \
cd /llm && \
git clone https://github.com/vllm-project/vllm-xpu-kernels.git && \
cd vllm-xpu-kernels && \
git checkout 4c83144 && \
sed -i 's|^--extra-index-url=https://download.pytorch.org/whl/xpu|# --extra-index-url=https://download.pytorch.org/whl/xpu|' requirements.txt && \
sed -i 's|^torch==2.10.0+xpu|# torch==2.10.0+xpu|' requirements.txt && \
sed -i 's|^triton-xpu|# triton-xpu|' requirements.txt && \
sed -i 's|^transformers|# transformers|' requirements.txt && \
pip install -r requirements.txt && \
pip install --no-build-isolation .

RUN --mount=type=cache,target=/root/.cache/pip \
pip uninstall triton triton-xpu -y && \
pip install triton-xpu==3.6.0 --extra-index-url=https://download.pytorch.org/whl/test/xpu

ENV VLLM_QUANTIZE_Q40_LIB="/usr/local/lib/python3.12/dist-packages/vllm_int4_for_multi_arc.so"

RUN pip uninstall oneccl oneccl-devel -y || true
RUN rm /usr/lib/python3/dist-packages/PyJWT-2.7.0.dist-info/ -rf || true
RUN echo "source /opt/intel/oneapi/setvars.sh --force" >> /root/.bashrc

# Copy Lunar Lake launch helper
COPY ./scripts/lunar_lake_serve.sh /llm/

ENTRYPOINT ["bash", "-c", "source /root/.bashrc && exec bash"]
Loading