Skip to content

Merge upstream→gfx11#1034

Merged
mgehre-amd merged 2 commits into
ROCm:gfx11from
eble-amd:merge-from-upstream
Jun 29, 2026
Merged

Merge upstream→gfx11#1034
mgehre-amd merged 2 commits into
ROCm:gfx11from
eble-amd:merge-from-upstream

Conversation

@eble-amd

@eble-amd eble-amd commented Jun 29, 2026

Copy link
Copy Markdown

Upstream and local Gemma 4 code had some similar changes to fix issues. Where there were conflicts, I kept the upstream version.

I'll record my manual testing in comments.

mgoin and others added 2 commits June 4, 2026 07:40
Upstream and local Gemma 4 code had some similar changes to fix issues.
Where there were conflicts, I kept the upstream version.

Signed-off-by: Dan Eble <Dan.Eble@amd.com>
@eble-amd

eble-amd commented Jun 29, 2026

Copy link
Copy Markdown
Author

The Gemma 4 merge conflicts that I resolved involved #1016, so I ran the following local tests with and without --dynamic-lm-head-quantization int8:

With:

Arguments: --model cyankiwi/gemma-4-E4B-it-AWQ-INT4 --num-prompts 10 --max-model-len 4096 --kill-existing-vllm-processes --input-len 100 --output-len 200 --dynamic-lm-head-quantization int8 --target-gpu-memory-gb 20 --max-num-seqs 1 --synthetic-mm --synthetic-mm-width 1024 --synthetic-mm-height 800 --synthetic-mm-num-images 1 --synthetic-mm-base-items-per-request 1 --mm-encoder-attn-backend TRITON_ATTN -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 -e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE -e TORCH_BLAS_PREFER_HIPBLASLT=1
Prefill 1423.23 tokens/s; TTFT 255 ms
Decode 57.2 tokens/s (TPOT 17.49 ms)
End-to-end latency 3735 ms (median)
Average CPU utilization 6%

Without:

Arguments: --model cyankiwi/gemma-4-E4B-it-AWQ-INT4 --num-prompts 10 --max-model-len 4096 --kill-existing-vllm-processes --input-len 100 --output-len 200 --target-gpu-memory-gb 16 --max-num-seqs 1 --synthetic-mm --synthetic-mm-width 1024 --synthetic-mm-height 800 --synthetic-mm-num-images 1 --synthetic-mm-base-items-per-request 1 --mm-encoder-attn-backend TRITON_ATTN -e TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1 -e FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE -e TORCH_BLAS_PREFER_HIPBLASLT=1
Prefill 1403.39 tokens/s; TTFT 259 ms
Decode 49.5 tokens/s (TPOT 20.22 ms)
End-to-end latency 4282 ms (median)
Average CPU utilization 6%

@eble-amd eble-amd marked this pull request as ready for review June 29, 2026 19:53
@mgehre-amd mgehre-amd merged commit da898a4 into ROCm:gfx11 Jun 29, 2026
7 checks passed
@eble-amd eble-amd deleted the merge-from-upstream branch June 29, 2026 20:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants