[quantization] Evaluate fk llama model by stamalakhov · Pull Request #626 · Samsung/TICO

stamalakhov · 2026-04-13T10:50:49Z

This PR adds script to evaluate fake-quantized Llama-based model.

sample run on the model produced by #625

Namespace(model='Maykeye/TinyLLama-v0', device='cuda', dtype='float32', hf_token=None, trust_remote_code=False, cache_dir='/mnt/storage/transformers_cache', fk_model_path='./PTQ_Maykeye_TinyLLama-v0_SpinQuant_GPTQ_smse_32_42.pt', eval_tasks='openbookqa')
=== Config ===
Model            : Maykeye/TinyLLama-v0
Device           : cuda
DType            : float32
fk_model_path    : ./PTQ_Maykeye_TinyLLama-v0_SpinQuant_GPTQ_smse_32_42.pt

Loading FP model …
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:00<00:00, 1893.81it/s]
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
Generating train split: 100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 4957/4957 [00:00<00:00, 18822.43 examples/s]
Generating validation split: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 67593.37 examples/s]
Generating test split: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:00<00:00, 118946.85 examples/s]
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:04<00:00, 106.94it/s]
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [00:32<00:00, 61.30it/s]
Original RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.114|±  |0.0142|
|          |       |none  |     0|acc_norm|↑  |0.210|±  |0.0182|

`pretrained` model kwarg is not of type `str`. Many other model arguments may be ignored. Please do not launch via accelerate or use `parallelize=True` if passing an existing model this way.
Passed an already-initialized model through `pretrained`, assuming single-process call to evaluate() or custom distributed integration
100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 500/500 [00:04<00:00, 109.21it/s]
Running loglikelihood requests:   0%|                                                                                                                                                                                           | 0/2000 [00:00<?, ?it/s]`use_return_dict` is deprecated! Use `return_dict` instead!
Running loglikelihood requests: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2000/2000 [09:06<00:00,  3.66it/s]
Quantized RESULTS ARE:
|  Tasks   |Version|Filter|n-shot| Metric |   |Value|   |Stderr|
|----------|------:|------|-----:|--------|---|----:|---|-----:|
|openbookqa|      1|none  |     0|acc     |↑  |0.118|±  |0.0144|
|          |       |none  |     0|acc_norm|↑  |0.210|±  |0.0182|

TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

This PR adds script to evaluate fake-quantized Llama-based model. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

mhs4670go · 2026-04-13T13:10:55Z

tico/quantization/wrapq/examples/evaluate_fk_llama_model.py

+    if torch.cuda.is_available():
+        torch.cuda.empty_cache()
+
+    if args.eval_tasks is not None:


(Optional) Loading the FP model every time may be unnecessarily expensive if the main goal is to re-evaluate a previously saved quantized model. Would it make sense to support a quantized-only evaluation mode? (e.g. --skip_fp_eval)

tico/quantization/wrapq/examples/evaluate_fk_llama_model.py

mhs4670go · 2026-04-13T13:13:22Z

tico/quantization/wrapq/examples/evaluate_fk_llama_model.py

+    )
+
+    parser.add_argument(
+        "--eval_tasks",


If eval_tasks is omitted, the script effectively loads both models and exits without evaluation. It may be better to either make this argument required or fail with a clearer message.

Co-authored-by: seongwoo chae <mhs4670go@naver.com>

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov self-assigned this Apr 13, 2026

[quantization] Evaluate fk llama model

a376e41

This PR adds script to evaluate fake-quantized Llama-based model. TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov force-pushed the evaluate_fk_model branch from 4941ea7 to a376e41 Compare April 13, 2026 10:55

stamalakhov requested review from a team and mhs4670go April 13, 2026 11:00

stamalakhov mentioned this pull request Apr 13, 2026

[quantization] Save torch artifacts #625

Open

mhs4670go reviewed Apr 13, 2026

View reviewed changes

tico/quantization/wrapq/examples/evaluate_fk_llama_model.py Outdated Show resolved Hide resolved

mhs4670go reviewed Apr 13, 2026

View reviewed changes

stamalakhov and others added 2 commits April 13, 2026 16:16

Apply suggestions from code review

723db50

Co-authored-by: seongwoo chae <mhs4670go@naver.com>

Apply suggestions from code review.

3c1b86f

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>

stamalakhov requested a review from mhs4670go April 13, 2026 14:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[quantization] Evaluate fk llama model#626

[quantization] Evaluate fk llama model#626
stamalakhov wants to merge 3 commits intoSamsung:mainfrom
stamalakhov:evaluate_fk_model

stamalakhov commented Apr 13, 2026

Uh oh!

mhs4670go Apr 13, 2026

Uh oh!

Uh oh!

mhs4670go Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

stamalakhov commented Apr 13, 2026

Uh oh!

mhs4670go Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

mhs4670go Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants