Skip to content

[quantization] Evaluate fk llama model#626

Open
stamalakhov wants to merge 3 commits intoSamsung:mainfrom
stamalakhov:evaluate_fk_model
Open

[quantization] Evaluate fk llama model#626
stamalakhov wants to merge 3 commits intoSamsung:mainfrom
stamalakhov:evaluate_fk_model

Conversation

@stamalakhov
Copy link
Copy Markdown
Contributor

This PR adds script to evaluate fake-quantized Llama-based model.

sample run on the model produced by #625
Namespace(model='Maykeye/TinyLLama-v0', device='cuda', dtype='float32', hf_token=None, trust_remote_code=False, cache_dir='/mnt/storage/transformers_cache', fk_model_path='./PTQ_Maykeye_TinyLLama-v0_SpinQuant_GPTQ_smse_32_42.pt', eval_tasks='openbookqa')
=== Config ===
Model            : Maykeye/TinyLLama-v0
Device           : cuda
DType            : float32
fk_model_path    : ./PTQ_Maykeye_TinyLLama-v0_SpinQuant_GPTQ_smse_32_42.pt

Loading FP model …
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:00<00:00, 1893.81it/s]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4957/4957 [00:00<00:00, 18822.43 examples/s]
Generating validation split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 67593.37 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 118946.85 examples/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:04<00:00, 106.94it/s]
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:32<00:00, 61.30it/s]
Original RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.114|±  |0.0142|
|          |       |none  |     0|acc_norm|↑  |0.210|±  |0.0182|

`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:04<00:00, 109.21it/s]
Running loglikelihood requests:   0%|                                                                                                                                                                                           | 0/2000 [00:00<?, ?it/s]`use_return_dict` is deprecated! Use `return_dict` instead!
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [09:06<00:00,  3.66it/s]
Quantized RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.118|±  |0.0144|
|          |       |none  |     0|acc_norm|↑  |0.210|±  |0.0182|

TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

@stamalakhov stamalakhov self-assigned this Apr 13, 2026
This PR adds script to evaluate fake-quantized Llama-based model.

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
if torch.cuda.is_available():
torch.cuda.empty_cache()

if args.eval_tasks is not None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(Optional) Loading the FP model every time may be unnecessarily expensive if the main goal is to re-evaluate a previously saved quantized model. Would it make sense to support a quantized-only evaluation mode? (e.g. --skip_fp_eval)

)

parser.add_argument(
"--eval_tasks",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If eval_tasks is omitted, the script effectively loads both models and exits without evaluation. It may be better to either make this argument required or fail with a clearer message.

stamalakhov and others added 2 commits April 13, 2026 16:16
Co-authored-by: seongwoo chae <mhs4670go@naver.com>
TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
@stamalakhov stamalakhov requested a review from mhs4670go April 13, 2026 14:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants