[quantization] Support SpinQuant#607
Conversation
This commit supports SpinQuant. TICO-DCO-1.0-Signed-off-by: seongwoo <mhs4670go@naver.com>
| model = prepare(model, SpinQuantConfig()) | ||
| model = convert(model) |
There was a problem hiding this comment.
Just a suggestion: may be we can have an option to turn it on/off (like there is an option no_GPTQ to turn GPTQ on/off )?
|
|
||
| # Apply the SpinQuant rotation only when the source model provides it. | ||
| if self.rotate_embedding is not None: | ||
| hidden_states = self.rotate_embedding(hidden_states) |
There was a problem hiding this comment.
So these are not for free? I mean embedding/lm_head.
There was a problem hiding this comment.
Yes. This is for tied embedding. Trade-off between memory footprint and lienar layer overheads.
Overhead comparison (LLaMA 3.2-3B)
Assumptions:
- hidden size = 3072
- vocab size ≈ 128k
- tied embeddings (default LLaMA behavior)
Summary
| Aspect | Option A (tie + rotate) | Option B (untie + fuse) |
|---|---|---|
| Param overhead | ~18.9M (~0.6%) | ~394M (~12%) |
| Memory | Low | Very high |
| Compute | Extra matmul (2x) | None |
| Latency | Worse (prefill) | Better |
Option A — Keep tied embeddings + add rotation layers
Add:
rotate_embedding: 3072 × 3072rotate_lm_head: 3072 × 3072
Parameter overhead
- Per layer: ~9.44M params
- Total: ~18.9M params (~0.6% of 3.2B)
Memory overhead
| dtype | memory |
|---|---|
| FP32 | ~75 MB |
| FP16/BF16 | ~38 MB |
| INT8 | ~19 MB |
| INT4 | ~9 MB |
Compute overhead
- Per token: ~18.9M MACs (2 matmuls)
- Prefill (seq = S): ~18.9M × S
- Decode: constant per token
Option B — Untie embeddings + fuse rotation into weights
- Remove rotation layers
- Duplicate embedding weights
Parameter overhead
- Extra embedding copy:
- 128256 × 3072 ≈ 394M params (~12%)
Memory overhead
| dtype | memory |
|---|---|
| FP32 | ~1.58 GB |
| FP16/BF16 | ~788 MB |
| INT8 | ~394 MB |
| INT4 | ~197 MB |
Compute overhead
- None
There was a problem hiding this comment.
Got it. Thank you very much for a detailed answer.
|
|
||
| Parameters: | ||
| module: Target linear module with weight shape [out_features, in_features]. | ||
| rotation: Square rotation matrix of shape [in_features, in_features]. |
There was a problem hiding this comment.
(Very Optional) How about adding these comments?
| rotation: Square rotation matrix of shape [in_features, in_features]. | |
| rotation: Square rotation matrix of shape [in_features, in_features]. | |
| Notes: | |
| PyTorch linear computes y = x @ W.T (row-vector convention). | |
| Right-multiplying W by R gives W_new = W @ R, so: | |
| y = x @ W_new.T = x @ R.T @ W.T | |
| This is equivalent to rotating the input by R.T before the original weight. |
(Background) It took me a while to understand why we multiply by
There was a problem hiding this comment.
Thank you for the suggesiton. It'd be defineitly helpful.
|
|
||
| Parameters: | ||
| module: Target linear module with weight shape [out_features, in_features]. | ||
| rotation_t: Transposed square rotation matrix of shape [out_features, out_features]. |
There was a problem hiding this comment.
(Very Optional) How about adding these comment for future reader?
| rotation_t: Transposed square rotation matrix of shape [out_features, out_features]. | |
| rotation_t: Transposed square rotation matrix of shape [out_features, out_features]. | |
| Notes: | |
| The caller passes r1.T (i.e., rotation_t = R^T). | |
| This stores W_new = R^T @ W, so: | |
| y = x @ W_new.T = x @ W.T @ R | |
| Equivalent to rotating the output by R after the original weight. |
(Background) Same reason with https://github.com/Samsung/TICO/pull/607/changes#r3037530480.
| - `model.rotate_lm_head` | ||
|
|
||
| This preserves tied embedding behavior while still applying the same logical transforms | ||
| during inference. |
There was a problem hiding this comment.
Just for curiosity, do you know how hadamard matrix is lowered in our internal NPU?
Because In paper, it said that hadamard matrix multiplication can be implemented with some effecient algorithm so it has marginal overhead.
I was just wondering if we can implement it in a similar way.
(from the spin quant paper)
Hadamard rotation can be computed with fast hadamard transform and introduce marginal overhead to the inference latency.
This is because online Hadamard transforms can be efficiently implemented without significant overhead.
There was a problem hiding this comment.
Can't write down the details:) But, there's no difference between the hadamard matrix and other nn.Linear layers. rotate_embedding and rotate_lm_head are just linear layers.
FYI, the hadamard matrix computation (x @ H) has O(n^2) complexity. But, FHT (fast hadamard transform) has the same thing with only O(nlogn).
| SpinQuant is a rotation-based pre-quantization algorithm for large language models. | ||
| Its goal is to make a model more quantization-friendly by applying offline orthogonal | ||
| transformations to the weight space before downstream quantization. | ||
|
|
There was a problem hiding this comment.
(Very Optional) Adding comment about original paper might be helpful.
For me, reading paper really helps to understand the overall concept and where the term R1, R2 comes from and why they are inserted in specific positions.
| > **Reference**: Liu et al., *"SpinQuant: LLM Quantization with Learned Rotations"*, | |
| > arXiv:2405.16406, 2024. — R1 and R2 follow the rotation notation used in this paper. | |
zetwhite
left a comment
There was a problem hiding this comment.
LGTM 👍
I read the code except hardamard_uitls.py and all looks good 😄
I added some comments to improve documents for future readers, but it totally optional!
This commit supports SpinQuant.
TICO-DCO-1.0-Signed-off-by: seongwoo mhs4670go@naver.com