Skip to content

fix(gptq): fix Hessian computation, variable-length sequence support, and layer output type handling#318

Merged
linchuanxie merged 1 commit into
Tencent:mainfrom
sunnyxiaohu:fix/gptq-core-bugs
Jun 10, 2026
Merged

fix(gptq): fix Hessian computation, variable-length sequence support, and layer output type handling#318
linchuanxie merged 1 commit into
Tencent:mainfrom
sunnyxiaohu:fix/gptq-core-bugs

Conversation

@sunnyxiaohu

Copy link
Copy Markdown
Contributor

Summary

Fix critical bugs in the GPTQ quantization pipeline that cause incorrect quantization results, especially for MoE models and variable-length calibration data.

Problems

  1. Hessian matrix corruption for MoE expertsadd_batch() uses parameterless squeeze() which collapses [1, dim] to [dim] when an expert receives only 1 routed token, causing nsamples to be incorrectly accumulated as feature_dim instead of 1.

  2. Variable-length sequence incompatibilityCatcher pre-allocates a fixed-size tensor [nsamples, seq_length, hidden_size], requiring all samples to have identical seq_len. Shorter sequences get zero-padded (introducing Hessian noise) and longer sequences are silently truncated.

  3. Layer output type mismatch — Unconditional layer(...)[0] assumes tuple output, but some decoder layers return a plain tensor. [0] then incorrectly indexes the batch dimension.

  4. ignore_layers exact match fails for MoE — Nested module names like mlp.experts.0.gate_proj cannot match the ignore pattern gate_proj with exact equality.

  5. _make_quant AttributeError on non-standard Linear — Modules like TopKRouter lack in_features/out_features/bias attributes, causing crashes during weight replacement.

  6. g_idx generation uses slow Python list comprehension — Replaced with vectorized tensor operations.

Changes

File Fix
gptq_module.py Remove dangerous squeeze(), fix add_batch() reshape logic, vectorize g_idx
catcher.py Rewrite to dynamic list storage with per-sample kwargs and max_seq_length VRAM guard
gptq.py Add _extract_hidden_states() helper, per-sample forward loop, substring ignore_layers matching
helper_layer.py Use getattr(linear, "bias", None) for non-standard Linear modules

Testing

  • Verified on Qwen3-30B-A3B (MoE, variable expert routing)
  • Verified on standard dense models (Qwen3-4B)
  • No regression on existing quantization quality metrics

yghstill
yghstill previously approved these changes Jun 1, 2026
@yghstill

yghstill commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator

@sunnyxiaohu
Please per-commit code formatting:

pip3 install pre-commit black isort flake8
cd AngelSlim
pre-commit install

… and output type handling

- Fix add_batch() squeeze() bug that corrupts Hessian for MoE experts with single-token routing
- Rewrite Catcher to use dynamic list storage, supporting variable-length sequences
- Fix layer output type handling: use _extract_hidden_states() instead of unconditional [0]
- Fix ignore_layers matching: use substring match for nested MoE module names
- Fix _make_quant: support non-standard Linear modules lacking in_features/bias attributes
- Fix g_idx generation: use vectorized tensor ops instead of Python list comprehension
@sunnyxiaohu

sunnyxiaohu commented Jun 8, 2026

Copy link
Copy Markdown
Contributor Author

@sunnyxiaohu Please per-commit code formatting:

pip3 install pre-commit black isort flake8
cd AngelSlim
pre-commit install

fixed

@sunnyxiaohu sunnyxiaohu closed this Jun 8, 2026
@sunnyxiaohu sunnyxiaohu reopened this Jun 8, 2026
@linchuanxie linchuanxie merged commit 86479db into Tencent:main Jun 10, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants