[quantization] Move prefill logic by stamalakhov · Pull Request #613 · Samsung/TICO

stamalakhov · 2026-04-06T08:06:05Z

This PR moves prefill logic to PrefillQModelProcessor to make the script be ready for prefill-decode pipeline.

log of `python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py --model Maykeye/TinyLLama-v0 --gptq_mse mse --save_circle_to_folder . --save_layers_to_folder . --eval_tasks openbookqa ... `

Namespace(model='Maykeye/TinyLLama-v0', device='cuda', dtype='float32', seed=42, trust_remote_code=False, hf_token=None, no_tqdm=False, no_GPTQ=False, no_spinquant=False, no_PTQ=False, save_circle_to_folder='.', save_layers_to_folder='.', cache_dir='/mnt/storage/transformers_cache', nsamples_for_qcalibration=128, linear_weight_bits=4, gptq_mse='mse', max_seq_len=2048, calibrate_seq_len=2048, embedding_weight_bits=8, lm_head_weight_bits=4, eval_tasks='openbookqa', sensitivity_path=None)
=== Config ===
Model            : Maykeye/TinyLLama-v0
Device           : cuda
DType            : float32

Loading FP model …
Applying SpinQuant preprocessing …
Applying SpinQuant rotations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 295.41it/s]

Calculating perplexities …
Token indices sequence length is longer than the specified maximum sequence length for this model (324381 > 2048). Running this sequence through the model will result in indexing errors
PPL:  99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 158/159 [00:07<00:00, 20.31it/s]

┌── Wikitext-2 test perplexity ─────────────
│ FP32 :  7584.31
└───────────────────────────────────────────
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:03<00:00, 131.73it/s]
Running loglikelihood requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:51<00:00, 38.95it/s]
Original RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.114|±  |0.0142|
|          |       |none  |     0|acc_norm|↑  |0.210|±  |0.0182|

Applying GPTQ …
Quantizing layers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:09<00:00,  1.18s/layer]
Wrapping layers with PTQWrapper …                                                                                                                                             
Calibrating PTQ observers…
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:48<00:00,  2.66it/s]

Calculating perplexities …
PPL:  99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 158/159 [00:30<00:00,  5.13it/s]

┌── Wikitext-2 test perplexity ─────────────
│ int16 :  7378.89
└───────────────────────────────────────────
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:02<00:00, 220.49it/s]
Running loglikelihood requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [03:59<00:00,  8.33it/s]
Quantized RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.112|±  |0.0141|
|          |       |none  |     0|acc_norm|↑  |0.222|±  |0.0186|

Saving model layer_0 to /mnt/storage/slow_repos/TICO/decoder_layer_0.q.circle
Saving model layer_1 to /mnt/storage/slow_repos/TICO/decoder_layer_1.q.circle
Saving model layer_2 to /mnt/storage/slow_repos/TICO/decoder_layer_2.q.circle
Saving model layer_3 to /mnt/storage/slow_repos/TICO/decoder_layer_3.q.circle
Saving model layer_4 to /mnt/storage/slow_repos/TICO/decoder_layer_4.q.circle
Saving model layer_5 to /mnt/storage/slow_repos/TICO/decoder_layer_5.q.circle
Saving model layer_6 to /mnt/storage/slow_repos/TICO/decoder_layer_6.q.circle
Saving model layer_7 to /mnt/storage/slow_repos/TICO/decoder_layer_7.q.circle
saving the whole model to /mnt/storage/slow_repos/TICO/model.q.circle

log of `python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py --model HuggingFaceTB/SmolLM2-135M-Instruct --gptq_mse mse --save_circle_to_folder . --save_layers_to_folder . --eval_tasks openbookqa --no_spinquant ... `

Namespace(model='HuggingFaceTB/SmolLM2-135M-Instruct', device='cuda', dtype='float32', seed=42, trust_remote_code=False, hf_token=None, no_tqdm=False, no_GPTQ=False, no_spinquant=True, no_PTQ=False, save_circle_to_folder='.', save_layers_to_folder='.', cache_dir=None, nsamples_for_qcalibration=128, linear_weight_bits=4, gptq_mse='mse', max_seq_len=2048, calibrate_seq_len=2048, embedding_weight_bits=8, lm_head_weight_bits=4, eval_tasks='openbookqa', sensitivity_path=None)
=== Config ===
Model            : HuggingFaceTB/SmolLM2-135M-Instruct
Device           : cuda
DType            : float32

Loading FP model …
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 272/272 [00:00<00:00, 2114.06it/s]
Skipping SpinQuant preprocessing …

Calculating perplexities …
Token indices sequence length is longer than the specified maximum sequence length for this model (304986 > 8192). Running this sequence through the model will result in indexing errors
PPL:  99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 148/149 [00:25<00:00,  5.72it/s]

┌── Wikitext-2 test perplexity ─────────────
│ FP32 :    17.38
└───────────────────────────────────────────
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:03<00:00, 132.42it/s]
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [01:22<00:00, 24.38it/s]
Original RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.224|±  |0.0187|
|          |       |none  |     0|acc_norm|↑  |0.332|±  |0.0211|

Applying GPTQ …
Quantizing layers: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:17<00:00,  4.58s/layer]
Wrapping layers with PTQWrapper …                                                                                                                                                                                                                                                                                    
Calibrating PTQ observers…
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [04:26<00:00,  2.08s/it]

Calculating perplexities …
PPL:  99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 148/149 [01:15<00:00,  1.96it/s]

┌── Wikitext-2 test perplexity ─────────────
│ int16 :    27.66
└───────────────────────────────────────────
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:03<00:00, 132.59it/s]
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [11:42<00:00,  2.85it/s]
Quantized RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.204|±  |0.0180|
|          |       |none  |     0|acc_norm|↑  |0.284|±  |0.0202|

Saving model layer_0 to /home/s.malakhov/projects/TICO/decoder_layer_0.q.circle
Saving model layer_1 to /home/s.malakhov/projects/TICO/decoder_layer_1.q.circle
Saving model layer_2 to /home/s.malakhov/projects/TICO/decoder_layer_2.q.circle
Saving model layer_3 to /home/s.malakhov/projects/TICO/decoder_layer_3.q.circle
Saving model layer_4 to /home/s.malakhov/projects/TICO/decoder_layer_4.q.circle
Saving model layer_5 to /home/s.malakhov/projects/TICO/decoder_layer_5.q.circle
Saving model layer_6 to /home/s.malakhov/projects/TICO/decoder_layer_6.q.circle
Saving model layer_7 to /home/s.malakhov/projects/TICO/decoder_layer_7.q.circle
Saving model layer_8 to /home/s.malakhov/projects/TICO/decoder_layer_8.q.circle
Saving model layer_9 to /home/s.malakhov/projects/TICO/decoder_layer_9.q.circle
Saving model layer_10 to /home/s.malakhov/projects/TICO/decoder_layer_10.q.circle
Saving model layer_11 to /home/s.malakhov/projects/TICO/decoder_layer_11.q.circle
Saving model layer_12 to /home/s.malakhov/projects/TICO/decoder_layer_12.q.circle
Saving model layer_13 to /home/s.malakhov/projects/TICO/decoder_layer_13.q.circle
Saving model layer_14 to /home/s.malakhov/projects/TICO/decoder_layer_14.q.circle
Saving model layer_15 to /home/s.malakhov/projects/TICO/decoder_layer_15.q.circle
Saving model layer_16 to /home/s.malakhov/projects/TICO/decoder_layer_16.q.circle
Saving model layer_17 to /home/s.malakhov/projects/TICO/decoder_layer_17.q.circle
Saving model layer_18 to /home/s.malakhov/projects/TICO/decoder_layer_18.q.circle
Saving model layer_19 to /home/s.malakhov/projects/TICO/decoder_layer_19.q.circle
Saving model layer_20 to /home/s.malakhov/projects/TICO/decoder_layer_20.q.circle
Saving model layer_21 to /home/s.malakhov/projects/TICO/decoder_layer_21.q.circle
Saving model layer_22 to /home/s.malakhov/projects/TICO/decoder_layer_22.q.circle
Saving model layer_23 to /home/s.malakhov/projects/TICO/decoder_layer_23.q.circle
Saving model layer_24 to /home/s.malakhov/projects/TICO/decoder_layer_24.q.circle
Saving model layer_25 to /home/s.malakhov/projects/TICO/decoder_layer_25.q.circle
Saving model layer_26 to /home/s.malakhov/projects/TICO/decoder_layer_26.q.circle
Saving model layer_27 to /home/s.malakhov/projects/TICO/decoder_layer_27.q.circle
Saving model layer_28 to /home/s.malakhov/projects/TICO/decoder_layer_28.q.circle
Saving model layer_29 to /home/s.malakhov/projects/TICO/decoder_layer_29.q.circle
saving the whole model to /home/s.malakhov/projects/TICO/model.q.circle

Draft: #570
Related: #586
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

This PR moves prefill logic to `PrefillQModelProcessor` to make the script be ready for prefill-decode pipeline. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

mhs4670go · 2026-04-12T11:26:04Z

@stamalakhov This PR opened before #612 has been resolved. Does it need to be moified according to the change? I'm just asking this because #570 is based on quant_XX_prefill/decode.py.

mhs4670go · 2026-04-12T12:09:13Z

tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py

+        """
+        return quantize_using_PTQ(q_m, calib_inputs, self.args)
+
+    def evaluate_quantized(self, model, dataset_test):


This api spec doesn't match with QModelProcessor. Is it intentional?

Ahhh. Sorry. I'll fix it.

stamalakhov · 2026-04-12T17:17:07Z

@stamalakhov This PR opened before #612 has been resolved. Does it need to be moified according to the change? I'm just asking this because #570 is based on quant_XX_prefill/decode.py.

@mhs4670go
Sorry for a late reply. #570 will be reworked due #612 (only single quantization model should be saved), but this one is just a refactoring.

stamalakhov self-assigned this Apr 6, 2026

stamalakhov force-pushed the refactor_gptq branch 2 times, most recently from 0ec605b to 0c68070 Compare April 6, 2026 08:23

[quantization] Move prefill logic

69537f2

This PR moves prefill logic to `PrefillQModelProcessor` to make the script be ready for prefill-decode pipeline. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov force-pushed the refactor_gptq branch from 0c68070 to 69537f2 Compare April 6, 2026 08:34

stamalakhov marked this pull request as ready for review April 6, 2026 09:38

stamalakhov requested a review from a team April 6, 2026 09:38

stamalakhov changed the title ~~[quantization] [draft] Move prefill logic~~ [quantization] Move prefill logic Apr 6, 2026

mhs4670go reviewed Apr 12, 2026

View reviewed changes

stamalakhov marked this pull request as draft April 12, 2026 17:36

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Move prefill logic#613

[quantization] Move prefill logic#613
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:refactor_gptq

stamalakhov commented Apr 6, 2026 •

edited

Loading

Uh oh!

mhs4670go commented Apr 12, 2026

Uh oh!

mhs4670go Apr 12, 2026

Uh oh!

stamalakhov Apr 12, 2026

Uh oh!

stamalakhov commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stamalakhov commented Apr 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mhs4670go commented Apr 12, 2026

Uh oh!

mhs4670go Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

stamalakhov Apr 12, 2026

Choose a reason for hiding this comment

Uh oh!

stamalakhov commented Apr 12, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

stamalakhov commented Apr 6, 2026 •

edited

Loading