Skip to content

[quantization] Move prefill logic#613

Draft
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:refactor_gptq
Draft

[quantization] Move prefill logic#613
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:refactor_gptq

Conversation

@stamalakhov
Copy link
Copy Markdown
Contributor

@stamalakhov stamalakhov commented Apr 6, 2026

This PR moves prefill logic to PrefillQModelProcessor to make the script be ready for prefill-decode pipeline.

log of `python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py --model Maykeye/TinyLLama-v0 --gptq_mse mse --save_circle_to_folder . --save_layers_to_folder . --eval_tasks openbookqa ... `
Namespace(model='Maykeye/TinyLLama-v0', device='cuda', dtype='float32', seed=42, trust_remote_code=False, hf_token=None, no_tqdm=False, no_GPTQ=False, no_spinquant=False, no_PTQ=False, save_circle_to_folder='.', save_layers_to_folder='.', cache_dir='/mnt/storage/transformers_cache', nsamples_for_qcalibration=128, linear_weight_bits=4, gptq_mse='mse', max_seq_len=2048, calibrate_seq_len=2048, embedding_weight_bits=8, lm_head_weight_bits=4, eval_tasks='openbookqa', sensitivity_path=None)
=== Config ===
Model            : Maykeye/TinyLLama-v0
Device           : cuda
DType            : float32

Loading FP model …
Applying SpinQuant preprocessing …
Applying SpinQuant rotations: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 295.41it/s]

Calculating perplexities …
Token indices sequence length is longer than the specified maximum sequence length for this model (324381 > 2048). Running this sequence through the model will result in indexing errors
PPL:  99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 158/159 [00:07<00:00, 20.31it/s]

┌── Wikitext-2 test perplexity ─────────────
│ FP32 :  7584.31
└───────────────────────────────────────────
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:03<00:00, 131.73it/s]
Running loglikelihood requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:51<00:00, 38.95it/s]
Original RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.114|±  |0.0142|
|          |       |none  |     0|acc_norm|↑  |0.210|±  |0.0182|

Applying GPTQ …
Quantizing layers: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:09<00:00,  1.18s/layer]
Wrapping layers with PTQWrapper …                                                                                                                                             
Calibrating PTQ observers…
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [00:48<00:00,  2.66it/s]

Calculating perplexities …
PPL:  99%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏| 158/159 [00:30<00:00,  5.13it/s]

┌── Wikitext-2 test perplexity ─────────────
│ int16 :  7378.89
└───────────────────────────────────────────
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:02<00:00, 220.49it/s]
Running loglikelihood requests: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [03:59<00:00,  8.33it/s]
Quantized RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.112|±  |0.0141|
|          |       |none  |     0|acc_norm|↑  |0.222|±  |0.0186|

Saving model layer_0 to /mnt/storage/slow_repos/TICO/decoder_layer_0.q.circle
Saving model layer_1 to /mnt/storage/slow_repos/TICO/decoder_layer_1.q.circle
Saving model layer_2 to /mnt/storage/slow_repos/TICO/decoder_layer_2.q.circle
Saving model layer_3 to /mnt/storage/slow_repos/TICO/decoder_layer_3.q.circle
Saving model layer_4 to /mnt/storage/slow_repos/TICO/decoder_layer_4.q.circle
Saving model layer_5 to /mnt/storage/slow_repos/TICO/decoder_layer_5.q.circle
Saving model layer_6 to /mnt/storage/slow_repos/TICO/decoder_layer_6.q.circle
Saving model layer_7 to /mnt/storage/slow_repos/TICO/decoder_layer_7.q.circle
saving the whole model to /mnt/storage/slow_repos/TICO/model.q.circle
log of `python tico/quantization/wrapq/examples/quantize_full_qmodel_with_gptq.py --model HuggingFaceTB/SmolLM2-135M-Instruct --gptq_mse mse --save_circle_to_folder . --save_layers_to_folder . --eval_tasks openbookqa --no_spinquant ... `
Namespace(model='HuggingFaceTB/SmolLM2-135M-Instruct', device='cuda', dtype='float32', seed=42, trust_remote_code=False, hf_token=None, no_tqdm=False, no_GPTQ=False, no_spinquant=True, no_PTQ=False, save_circle_to_folder='.', save_layers_to_folder='.', cache_dir=None, nsamples_for_qcalibration=128, linear_weight_bits=4, gptq_mse='mse', max_seq_len=2048, calibrate_seq_len=2048, embedding_weight_bits=8, lm_head_weight_bits=4, eval_tasks='openbookqa', sensitivity_path=None)
=== Config ===
Model            : HuggingFaceTB/SmolLM2-135M-Instruct
Device           : cuda
DType            : float32

Loading FP model …
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 272/272 [00:00<00:00, 2114.06it/s]
Skipping SpinQuant preprocessing …

Calculating perplexities …
Token indices sequence length is longer than the specified maximum sequence length for this model (304986 > 8192). Running this sequence through the model will result in indexing errors
PPL:  99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 148/149 [00:25<00:00,  5.72it/s]

┌── Wikitext-2 test perplexity ─────────────
│ FP32 :    17.38
└───────────────────────────────────────────
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:03<00:00, 132.42it/s]
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [01:22<00:00, 24.38it/s]
Original RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.224|±  |0.0187|
|          |       |none  |     0|acc_norm|↑  |0.332|±  |0.0211|

Applying GPTQ …
Quantizing layers: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 30/30 [02:17<00:00,  4.58s/layer]
Wrapping layers with PTQWrapper …                                                                                                                                                                                                                                                                                    
Calibrating PTQ observers…
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 128/128 [04:26<00:00,  2.08s/it]

Calculating perplexities …
PPL:  99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▏ | 148/149 [01:15<00:00,  1.96it/s]

┌── Wikitext-2 test perplexity ─────────────
│ int16 :    27.66
└───────────────────────────────────────────
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:03<00:00, 132.59it/s]
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [11:42<00:00,  2.85it/s]
Quantized RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.204|±  |0.0180|
|          |       |none  |     0|acc_norm|↑  |0.284|±  |0.0202|

Saving model layer_0 to /home/s.malakhov/projects/TICO/decoder_layer_0.q.circle
Saving model layer_1 to /home/s.malakhov/projects/TICO/decoder_layer_1.q.circle
Saving model layer_2 to /home/s.malakhov/projects/TICO/decoder_layer_2.q.circle
Saving model layer_3 to /home/s.malakhov/projects/TICO/decoder_layer_3.q.circle
Saving model layer_4 to /home/s.malakhov/projects/TICO/decoder_layer_4.q.circle
Saving model layer_5 to /home/s.malakhov/projects/TICO/decoder_layer_5.q.circle
Saving model layer_6 to /home/s.malakhov/projects/TICO/decoder_layer_6.q.circle
Saving model layer_7 to /home/s.malakhov/projects/TICO/decoder_layer_7.q.circle
Saving model layer_8 to /home/s.malakhov/projects/TICO/decoder_layer_8.q.circle
Saving model layer_9 to /home/s.malakhov/projects/TICO/decoder_layer_9.q.circle
Saving model layer_10 to /home/s.malakhov/projects/TICO/decoder_layer_10.q.circle
Saving model layer_11 to /home/s.malakhov/projects/TICO/decoder_layer_11.q.circle
Saving model layer_12 to /home/s.malakhov/projects/TICO/decoder_layer_12.q.circle
Saving model layer_13 to /home/s.malakhov/projects/TICO/decoder_layer_13.q.circle
Saving model layer_14 to /home/s.malakhov/projects/TICO/decoder_layer_14.q.circle
Saving model layer_15 to /home/s.malakhov/projects/TICO/decoder_layer_15.q.circle
Saving model layer_16 to /home/s.malakhov/projects/TICO/decoder_layer_16.q.circle
Saving model layer_17 to /home/s.malakhov/projects/TICO/decoder_layer_17.q.circle
Saving model layer_18 to /home/s.malakhov/projects/TICO/decoder_layer_18.q.circle
Saving model layer_19 to /home/s.malakhov/projects/TICO/decoder_layer_19.q.circle
Saving model layer_20 to /home/s.malakhov/projects/TICO/decoder_layer_20.q.circle
Saving model layer_21 to /home/s.malakhov/projects/TICO/decoder_layer_21.q.circle
Saving model layer_22 to /home/s.malakhov/projects/TICO/decoder_layer_22.q.circle
Saving model layer_23 to /home/s.malakhov/projects/TICO/decoder_layer_23.q.circle
Saving model layer_24 to /home/s.malakhov/projects/TICO/decoder_layer_24.q.circle
Saving model layer_25 to /home/s.malakhov/projects/TICO/decoder_layer_25.q.circle
Saving model layer_26 to /home/s.malakhov/projects/TICO/decoder_layer_26.q.circle
Saving model layer_27 to /home/s.malakhov/projects/TICO/decoder_layer_27.q.circle
Saving model layer_28 to /home/s.malakhov/projects/TICO/decoder_layer_28.q.circle
Saving model layer_29 to /home/s.malakhov/projects/TICO/decoder_layer_29.q.circle
saving the whole model to /home/s.malakhov/projects/TICO/model.q.circle

Draft: #570
Related: #586
TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

@stamalakhov stamalakhov self-assigned this Apr 6, 2026
@stamalakhov stamalakhov force-pushed the refactor_gptq branch 2 times, most recently from 0ec605b to 0c68070 Compare April 6, 2026 08:23
This PR moves prefill logic to `PrefillQModelProcessor` to make the script be ready  for prefill-decode pipeline.

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
@stamalakhov stamalakhov marked this pull request as ready for review April 6, 2026 09:38
@stamalakhov stamalakhov requested a review from a team April 6, 2026 09:38
@stamalakhov stamalakhov changed the title [quantization] [draft] Move prefill logic [quantization] Move prefill logic Apr 6, 2026
@mhs4670go
Copy link
Copy Markdown
Contributor

@stamalakhov This PR opened before #612 has been resolved. Does it need to be moified according to the change? I'm just asking this because #570 is based on quant_XX_prefill/decode.py.

"""
return quantize_using_PTQ(q_m, calib_inputs, self.args)

def evaluate_quantized(self, model, dataset_test):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This api spec doesn't match with QModelProcessor. Is it intentional?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahhh. Sorry. I'll fix it.

@stamalakhov
Copy link
Copy Markdown
Contributor Author

@stamalakhov This PR opened before #612 has been resolved. Does it need to be moified according to the change? I'm just asking this because #570 is based on quant_XX_prefill/decode.py.

@mhs4670go
Sorry for a late reply. #570 will be reworked due #612 (only single quantization model should be saved), but this one is just a refactoring.

@stamalakhov stamalakhov marked this pull request as draft April 12, 2026 17:36
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants