[quantization] Support SpinQuant by mhs4670go · Pull Request #607 · Samsung/TICO

mhs4670go · 2026-04-03T07:16:44Z

This commit supports SpinQuant.

┌── Wikitext-2 test perplexity ─────────────
│ FP32 :    11.05
└───────────────────────────────────────────
┌── Wikitext-2 test perplexity ─────────────
│ int16 :    11.92
└───────────────────────────────────────────

TICO-DCO-1.0-Signed-off-by: seongwoo mhs4670go@naver.com

This commit supports SpinQuant. TICO-DCO-1.0-Signed-off-by: seongwoo <mhs4670go@naver.com>

tico/quantization/algorithm/spinquant/README.md

stamalakhov · 2026-04-03T08:02:49Z

tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py

+    model = prepare(model, SpinQuantConfig())
+    model = convert(model)


Just a suggestion: may be we can have an option to turn it on/off (like there is an option no_GPTQ to turn GPTQ on/off )?

stamalakhov · 2026-04-03T08:04:08Z

tico/quantization/wrapq/wrappers/llama/quant_model.py

+
+        # Apply the SpinQuant rotation only when the source model provides it.
+        if self.rotate_embedding is not None:
+            hidden_states = self.rotate_embedding(hidden_states)


So these are not for free? I mean embedding/lm_head.

Yes. This is for tied embedding. Trade-off between memory footprint and lienar layer overheads.

Overhead comparison (LLaMA 3.2-3B)

Assumptions:

hidden size = 3072

vocab size ≈ 128k

tied embeddings (default LLaMA behavior)

Summary

Aspect Option A (tie + rotate) Option B (untie + fuse)

Param overhead ~18.9M (~0.6%) ~394M (~12%)

Memory Low Very high

Compute Extra matmul (2x) None

Latency Worse (prefill) Better

Option A — Keep tied embeddings + add rotation layers

Add:

rotate_embedding: 3072 × 3072

rotate_lm_head: 3072 × 3072

Parameter overhead

Per layer: ~9.44M params

Total: ~18.9M params (~0.6% of 3.2B)

Memory overhead

dtype memory

FP32 ~75 MB

FP16/BF16 ~38 MB

INT8 ~19 MB

INT4 ~9 MB

Compute overhead

Per token: ~18.9M MACs (2 matmuls)

Prefill (seq = S): ~18.9M × S

Decode: constant per token

Option B — Untie embeddings + fuse rotation into weights

Remove rotation layers

Duplicate embedding weights

Parameter overhead

Extra embedding copy:

128256 × 3072 ≈ 394M params (~12%)

Memory overhead

dtype memory

FP32 ~1.58 GB

FP16/BF16 ~788 MB

INT8 ~394 MB

INT4 ~197 MB

Compute overhead

None

Got it. Thank you very much for a detailed answer.

stamalakhov

LGTM! Thank you!

Torrero

LGTM

zetwhite · 2026-04-06T00:12:20Z

tico/quantization/algorithm/spinquant/rotation_utils.py

+
+    Parameters:
+        module: Target linear module with weight shape [out_features, in_features].
+        rotation: Square rotation matrix of shape [in_features, in_features].


(Very Optional) How about adding these comments?

Suggested change

rotation: Square rotation matrix of shape [in_features, in_features].

rotation: Square rotation matrix of shape [in_features, in_features].

Notes:

PyTorch linear computes y = x @ W.T (row-vector convention).

Right-multiplying W by R gives W_new = W @ R, so:

y = x @ W_new.T = x @ R.T @ W.T

This is equivalent to rotating the input by R.T before the original weight.

(Background) It took me a while to understand why we multiply by $R$ instead of $R_1^T$, unlike what's shown in the diagram. I think adding these comments will be helpful for future readers—most likely my future self!

Thank you for the suggesiton. It'd be defineitly helpful.

zetwhite · 2026-04-06T00:17:05Z

tico/quantization/algorithm/spinquant/rotation_utils.py

+
+    Parameters:
+        module: Target linear module with weight shape [out_features, in_features].
+        rotation_t: Transposed square rotation matrix of shape [out_features, out_features].


(Very Optional) How about adding these comment for future reader?

Suggested change

rotation_t: Transposed square rotation matrix of shape [out_features, out_features].

rotation_t: Transposed square rotation matrix of shape [out_features, out_features].

Notes:

The caller passes r1.T (i.e., rotation_t = R^T).

This stores W_new = R^T @ W, so:

y = x @ W_new.T = x @ W.T @ R

Equivalent to rotating the output by R after the original weight.

(Background) Same reason with https://github.com/Samsung/TICO/pull/607/changes#r3037530480.

zetwhite · 2026-04-06T01:05:18Z

tico/quantization/algorithm/spinquant/README.md

+- `model.rotate_lm_head`
+
+This preserves tied embedding behavior while still applying the same logical transforms
+during inference.


Just for curiosity, do you know how hadamard matrix is lowered in our internal NPU?

Because In paper, it said that hadamard matrix multiplication can be implemented with some effecient algorithm so it has marginal overhead.

I was just wondering if we can implement it in a similar way.

(from the spin quant paper)

Hadamard rotation can be computed with fast hadamard transform and introduce marginal overhead to the inference latency.

This is because online Hadamard transforms can be efficiently implemented without significant overhead.

Can't write down the details:) But, there's no difference between the hadamard matrix and other nn.Linear layers. rotate_embedding and rotate_lm_head are just linear layers.

FYI, the hadamard matrix computation (x @ H) has O(n^2) complexity. But, FHT (fast hadamard transform) has the same thing with only O(nlogn).

zetwhite · 2026-04-06T01:14:02Z

tico/quantization/algorithm/spinquant/README.md

+SpinQuant is a rotation-based pre-quantization algorithm for large language models.
+Its goal is to make a model more quantization-friendly by applying offline orthogonal
+transformations to the weight space before downstream quantization.
+


(Very Optional) Adding comment about original paper might be helpful.

For me, reading paper really helps to understand the overall concept and where the term R1, R2 comes from and why they are inserted in specific positions.

Suggested change

> **Reference**: Liu et al., *"SpinQuant: LLM Quantization with Learned Rotations"*,

> arXiv:2405.16406, 2024. — R1 and R2 follow the rotation notation used in this paper.

zetwhite

LGTM 👍

I read the code except hardamard_uitls.py and all looks good 😄
I added some comments to improve documents for future readers, but it totally optional!

mhs4670go requested review from a team and zetwhite April 3, 2026 07:16

mhs4670go force-pushed the spin branch from b61f0db to f8e7d81 Compare April 3, 2026 07:23

[quantization] Support SpinQuant

321d751

This commit supports SpinQuant. TICO-DCO-1.0-Signed-off-by: seongwoo <mhs4670go@naver.com>

mhs4670go force-pushed the spin branch from f8e7d81 to 321d751 Compare April 3, 2026 07:27

zetwhite reviewed Apr 3, 2026

View reviewed changes

tico/quantization/algorithm/spinquant/README.md Outdated Show resolved Hide resolved

zetwhite reviewed Apr 3, 2026

View reviewed changes

tico/quantization/algorithm/spinquant/README.md Outdated Show resolved Hide resolved

stamalakhov reviewed Apr 3, 2026

View reviewed changes

stamalakhov previously approved these changes Apr 3, 2026

View reviewed changes

Torrero approved these changes Apr 3, 2026

View reviewed changes

zetwhite reviewed Apr 6, 2026

View reviewed changes

zetwhite previously approved these changes Apr 6, 2026

View reviewed changes

apply comments.

6d6cd3a

mhs4670go dismissed stale reviews from zetwhite and stamalakhov via 6d6cd3a April 6, 2026 01:34

mhs4670go force-pushed the spin branch from 2e63d5a to 6d6cd3a Compare April 6, 2026 01:34

mhs4670go requested review from stamalakhov and zetwhite April 6, 2026 01:34

zetwhite approved these changes Apr 6, 2026

View reviewed changes

stamalakhov approved these changes Apr 6, 2026

View reviewed changes

mhs4670go merged commit 31d8e93 into Samsung:main Apr 6, 2026
7 checks passed

mhs4670go deleted the spin branch April 6, 2026 04:54

		model = prepare(model, SpinQuantConfig())
		model = convert(model)

Aspect	Option A (tie + rotate)	Option B (untie + fuse)
Param overhead	~18.9M (~0.6%)	~394M (~12%)
Memory	Low	Very high
Compute	Extra matmul (2x)	None
Latency	Worse (prefill)	Better

-        rotation: Square rotation matrix of shape [in_features, in_features].
+        rotation: Square rotation matrix of shape [in_features, in_features].
+    Notes:
+	    PyTorch linear computes y = x @ W.T (row-vector convention).
+	    Right-multiplying W by R gives W_new = W @ R, so:
+	        y = x @ W_new.T = x @ R.T @ W.T
+	    This is equivalent to rotating the input by R.T before the original weight.

-        rotation_t: Transposed square rotation matrix of shape [out_features, out_features].
+        rotation_t: Transposed square rotation matrix of shape [out_features, out_features].
+    Notes:
+	    The caller passes r1.T (i.e., rotation_t = R^T).
+	    This stores W_new = R^T @ W, so:
+	        y = x @ W_new.T = x @ W.T @ R
+	    Equivalent to rotating the output by R after the original weight.



	> Reference: Liu et al., "SpinQuant: LLM Quantization with Learned Rotations",
	> arXiv:2405.16406, 2024. — R1 and R2 follow the rotation notation used in this paper.

Conversation

mhs4670go commented Apr 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Overhead comparison (LLaMA 3.2-3B)

Summary

Option A — Keep tied embeddings + add rotation layers

Parameter overhead

Memory overhead

Compute overhead

Option B — Untie embeddings + fuse rotation into weights

Parameter overhead

Memory overhead

Compute overhead

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stamalakhov left a comment

Choose a reason for hiding this comment

Uh oh!

Torrero left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zetwhite Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

zetwhite left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mhs4670go commented Apr 3, 2026 •

edited

Loading

zetwhite Apr 6, 2026 •

edited

Loading