fix(gptq): fix Hessian computation, variable-length sequence support, and layer output type handling#318
Merged
Conversation
yghstill
previously approved these changes
Jun 1, 2026
Collaborator
|
@sunnyxiaohu pip3 install pre-commit black isort flake8
cd AngelSlim
pre-commit install |
… and output type handling - Fix add_batch() squeeze() bug that corrupts Hessian for MoE experts with single-token routing - Rewrite Catcher to use dynamic list storage, supporting variable-length sequences - Fix layer output type handling: use _extract_hidden_states() instead of unconditional [0] - Fix ignore_layers matching: use substring match for nested MoE module names - Fix _make_quant: support non-standard Linear modules lacking in_features/bias attributes - Fix g_idx generation: use vectorized tensor ops instead of Python list comprehension
3e66200 to
5d482fd
Compare
Contributor
Author
fixed |
yghstill
approved these changes
Jun 8, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fix critical bugs in the GPTQ quantization pipeline that cause incorrect quantization results, especially for MoE models and variable-length calibration data.
Problems
Hessian matrix corruption for MoE experts —
add_batch()uses parameterlesssqueeze()which collapses[1, dim]to[dim]when an expert receives only 1 routed token, causingnsamplesto be incorrectly accumulated asfeature_diminstead of1.Variable-length sequence incompatibility —
Catcherpre-allocates a fixed-size tensor[nsamples, seq_length, hidden_size], requiring all samples to have identical seq_len. Shorter sequences get zero-padded (introducing Hessian noise) and longer sequences are silently truncated.Layer output type mismatch — Unconditional
layer(...)[0]assumes tuple output, but some decoder layers return a plain tensor.[0]then incorrectly indexes the batch dimension.ignore_layersexact match fails for MoE — Nested module names likemlp.experts.0.gate_projcannot match the ignore patterngate_projwith exact equality._make_quantAttributeError on non-standard Linear — Modules likeTopKRouterlackin_features/out_features/biasattributes, causing crashes during weight replacement.g_idxgeneration uses slow Python list comprehension — Replaced with vectorized tensor operations.Changes
gptq_module.pysqueeze(), fixadd_batch()reshape logic, vectorizeg_idxcatcher.pymax_seq_lengthVRAM guardgptq.py_extract_hidden_states()helper, per-sample forward loop, substringignore_layersmatchinghelper_layer.pygetattr(linear, "bias", None)for non-standard Linear modulesTesting