Merge upstream→gfx11 by eble-amd · Pull Request #1034 · ROCm/vllm

eble-amd · 2026-06-29T19:27:32Z

Upstream and local Gemma 4 code had some similar changes to fix issues. Where there were conflicts, I kept the upstream version.

I'll record my manual testing in comments.

…ings (vllm-project#44340) Signed-off-by: mgoin <mgoin64@gmail.com>

Upstream and local Gemma 4 code had some similar changes to fix issues. Where there were conflicts, I kept the upstream version. Signed-off-by: Dan Eble <Dan.Eble@amd.com>

eble-amd · 2026-06-29T19:52:33Z

The Gemma 4 merge conflicts that I resolved involved #1016, so I ran the following local tests with and without --dynamic-lm-head-quantization int8:

With:

Arguments: --model cyankiwi/gemma-4-E4B-it-AWQ-INT4 --num-prompts 10 --max-model-len 4096 --kill-existing-vllm-processes --input-len 100 --output-len 200 --dynamic-lm-head-quantization int8 --target-gpu-memory-gb 20 --max-num-seqs 1 --synthetic-mm --synthetic-mm-width 1024 --synthetic-mm-height 800 --synthetic-mm-num-images 1 --synthetic-mm-base-items-per-request 1 --mm-encoder-attn-backend TRITON_ATTN -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 -e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE -e TORCH_BLAS_PREFER_HIPBLASLT=1
Prefill 1423.23 tokens/s; TTFT 255 ms
Decode 57.2 tokens/s (TPOT 17.49 ms)
End-to-end latency 3735 ms (median)
Average CPU utilization 6%

Without:

Arguments: --model cyankiwi/gemma-4-E4B-it-AWQ-INT4 --num-prompts 10 --max-model-len 4096 --kill-existing-vllm-processes --input-len 100 --output-len 200 --target-gpu-memory-gb 16 --max-num-seqs 1 --synthetic-mm --synthetic-mm-width 1024 --synthetic-mm-height 800 --synthetic-mm-num-images 1 --synthetic-mm-base-items-per-request 1 --mm-encoder-attn-backend TRITON_ATTN -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 -e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE -e TORCH_BLAS_PREFER_HIPBLASLT=1
Prefill 1403.39 tokens/s; TTFT 259 ms
Decode 49.5 tokens/s (TPOT 20.22 ms)
End-to-end latency 4282 ms (median)
Average CPU utilization 6%

mgoin and others added 2 commits June 4, 2026 07:40

[Quant] Support compressed-tensors WNA8O8Int linears and WNInt embedd…

06ee2d8

…ings (vllm-project#44340) Signed-off-by: mgoin <mgoin64@gmail.com>

Merge commit '06ee2d8433831f69d5de3a6d9fa3d7d042dd394f' into wip

61812c8

Upstream and local Gemma 4 code had some similar changes to fix issues. Where there were conflicts, I kept the upstream version. Signed-off-by: Dan Eble <Dan.Eble@amd.com>

eble-amd requested review from marcusr-amd and mgehre-amd June 29, 2026 19:53

eble-amd marked this pull request as ready for review June 29, 2026 19:53

eble-amd requested a review from AndreasKaratzas as a code owner June 29, 2026 19:53

mgehre-amd approved these changes Jun 29, 2026

View reviewed changes

mgehre-amd merged commit da898a4 into ROCm:gfx11 Jun 29, 2026
7 checks passed

eble-amd deleted the merge-from-upstream branch June 29, 2026 20:12

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Merge upstream→gfx11#1034

Merge upstream→gfx11#1034
mgehre-amd merged 2 commits into
ROCm:gfx11from
eble-amd:merge-from-upstream

eble-amd commented Jun 29, 2026 •

edited by github-actions Bot

Loading

Uh oh!

eble-amd commented Jun 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Uh oh!

Conversation

eble-amd commented Jun 29, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

eble-amd commented Jun 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

eble-amd commented Jun 29, 2026 •

edited by github-actions Bot

Loading

eble-amd commented Jun 29, 2026 •

edited

Loading