Skip to content

[quantization] Support SpinQuant#607

Merged
mhs4670go merged 2 commits intoSamsung:mainfrom
mhs4670go:spin
Apr 6, 2026
Merged

[quantization] Support SpinQuant#607
mhs4670go merged 2 commits intoSamsung:mainfrom
mhs4670go:spin

Conversation

@mhs4670go
Copy link
Copy Markdown
Contributor

@mhs4670go mhs4670go commented Apr 3, 2026

This commit supports SpinQuant.

┌── Wikitext-2 test perplexity ─────────────
│ FP32 :    11.05
└───────────────────────────────────────────
┌── Wikitext-2 test perplexity ─────────────
│ int16 :    11.92
└───────────────────────────────────────────

TICO-DCO-1.0-Signed-off-by: seongwoo mhs4670go@naver.com

This commit supports SpinQuant.

TICO-DCO-1.0-Signed-off-by: seongwoo <mhs4670go@naver.com>
Comment on lines +368 to +369
model = prepare(model, SpinQuantConfig())
model = convert(model)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion: may be we can have an option to turn it on/off (like there is an option no_GPTQ to turn GPTQ on/off )?


# Apply the SpinQuant rotation only when the source model provides it.
if self.rotate_embedding is not None:
hidden_states = self.rotate_embedding(hidden_states)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So these are not for free? I mean embedding/lm_head.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. This is for tied embedding. Trade-off between memory footprint and lienar layer overheads.

Overhead comparison (LLaMA 3.2-3B)

Assumptions:

  • hidden size = 3072
  • vocab size ≈ 128k
  • tied embeddings (default LLaMA behavior)

Summary

Aspect Option A (tie + rotate) Option B (untie + fuse)
Param overhead ~18.9M (~0.6%) ~394M (~12%)
Memory Low Very high
Compute Extra matmul (2x) None
Latency Worse (prefill) Better

Option A — Keep tied embeddings + add rotation layers

Add:

  • rotate_embedding: 3072 × 3072
  • rotate_lm_head: 3072 × 3072

Parameter overhead

  • Per layer: ~9.44M params
  • Total: ~18.9M params (~0.6% of 3.2B)

Memory overhead

dtype memory
FP32 ~75 MB
FP16/BF16 ~38 MB
INT8 ~19 MB
INT4 ~9 MB

Compute overhead

  • Per token: ~18.9M MACs (2 matmuls)
  • Prefill (seq = S): ~18.9M × S
  • Decode: constant per token

Option B — Untie embeddings + fuse rotation into weights

  • Remove rotation layers
  • Duplicate embedding weights

Parameter overhead

  • Extra embedding copy:
    • 128256 × 3072 ≈ 394M params (~12%)

Memory overhead

dtype memory
FP32 ~1.58 GB
FP16/BF16 ~788 MB
INT8 ~394 MB
INT4 ~197 MB

Compute overhead

  • None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. Thank you very much for a detailed answer.

stamalakhov
stamalakhov previously approved these changes Apr 3, 2026
Copy link
Copy Markdown
Contributor

@stamalakhov stamalakhov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thank you!

Copy link
Copy Markdown
Contributor

@Torrero Torrero left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM


Parameters:
module: Target linear module with weight shape [out_features, in_features].
rotation: Square rotation matrix of shape [in_features, in_features].
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Very Optional) How about adding these comments?

Suggested change
rotation: Square rotation matrix of shape [in_features, in_features].
rotation: Square rotation matrix of shape [in_features, in_features].
Notes:
PyTorch linear computes y = x @ W.T (row-vector convention).
Right-multiplying W by R gives W_new = W @ R, so:
y = x @ W_new.T = x @ R.T @ W.T
This is equivalent to rotating the input by R.T before the original weight.

(Background) It took me a while to understand why we multiply by $R$ instead of $R_1^T$, unlike what's shown in the diagram. I think adding these comments will be helpful for future readers—most likely my future self!

Image

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the suggesiton. It'd be defineitly helpful.


Parameters:
module: Target linear module with weight shape [out_features, in_features].
rotation_t: Transposed square rotation matrix of shape [out_features, out_features].
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Very Optional) How about adding these comment for future reader?

Suggested change
rotation_t: Transposed square rotation matrix of shape [out_features, out_features].
rotation_t: Transposed square rotation matrix of shape [out_features, out_features].
Notes:
The caller passes r1.T (i.e., rotation_t = R^T).
This stores W_new = R^T @ W, so:
y = x @ W_new.T = x @ W.T @ R
Equivalent to rotating the output by R after the original weight.

(Background) Same reason with https://github.com/Samsung/TICO/pull/607/changes#r3037530480.

- `model.rotate_lm_head`

This preserves tied embedding behavior while still applying the same logical transforms
during inference.
Copy link
Copy Markdown

@zetwhite zetwhite Apr 6, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just for curiosity, do you know how hadamard matrix is lowered in our internal NPU?

Because In paper, it said that hadamard matrix multiplication can be implemented with some effecient algorithm so it has marginal overhead.

I was just wondering if we can implement it in a similar way.

(from the spin quant paper)

Hadamard rotation can be computed with fast hadamard transform and introduce marginal overhead to the inference latency.

This is because online Hadamard transforms can be efficiently implemented without significant overhead.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't write down the details:) But, there's no difference between the hadamard matrix and other nn.Linear layers. rotate_embedding and rotate_lm_head are just linear layers.

FYI, the hadamard matrix computation (x @ H) has O(n^2) complexity. But, FHT (fast hadamard transform) has the same thing with only O(nlogn).

SpinQuant is a rotation-based pre-quantization algorithm for large language models.
Its goal is to make a model more quantization-friendly by applying offline orthogonal
transformations to the weight space before downstream quantization.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Very Optional) Adding comment about original paper might be helpful.

For me, reading paper really helps to understand the overall concept and where the term R1, R2 comes from and why they are inserted in specific positions.

Suggested change
> **Reference**: Liu et al., *"SpinQuant: LLM Quantization with Learned Rotations"*,
> arXiv:2405.16406, 2024. — R1 and R2 follow the rotation notation used in this paper.

zetwhite
zetwhite previously approved these changes Apr 6, 2026
Copy link
Copy Markdown

@zetwhite zetwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 👍

I read the code except hardamard_uitls.py and all looks good 😄
I added some comments to improve documents for future readers, but it totally optional!

@mhs4670go mhs4670go merged commit 31d8e93 into Samsung:main Apr 6, 2026
7 checks passed
@mhs4670go mhs4670go deleted the spin branch April 6, 2026 04:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants