Skip to content

[quantization] Save torch artifacts#625

Open
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:save_torch_artifacts
Open

[quantization] Save torch artifacts#625
stamalakhov wants to merge 1 commit intoSamsung:mainfrom
stamalakhov:save_torch_artifacts

Conversation

@stamalakhov
Copy link
Copy Markdown
Contributor

@stamalakhov stamalakhov commented Apr 13, 2026

This PR saves torch artifacts to be used later in evaluation or quantization.

sample output for `Maykeye/TinyLLama-v0`
Namespace(model='Maykeye/TinyLLama-v0', device='cuda', dtype='float32', seed=42, trust_remote_code=False, hf_token=None, no_tqdm=False, no_GPTQ=False, no_spinquant=False, no_PTQ=False, save_circle_to_folder=None, save_layers_to_folder=None, save_torch_artifacts_to_folder='.', cache_dir='/mnt/storage/transformers_cache', nsamples_for_qcalibration=32, linear_weight_bits=4, gptq_mse='smse', max_seq_len=2048, calibrate_seq_len=2048, embedding_weight_bits=8, lm_head_weight_bits=4, eval_tasks=None, sensitivity_path=None)
=== Config ===
Model            : Maykeye/TinyLLama-v0
Device           : cuda
DType            : float32

Loading FP model …
Warning: You are sending unauthenticated requests to the HF Hub. Please set a HF_TOKEN to enable higher rate limits and faster downloads.
Loading weights: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 75/75 [00:00<00:00, 1714.24it/s]
Applying SpinQuant preprocessing …
Applying SpinQuant rotations: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:00<00:00, 274.02it/s]

Calculating original perplexities …
Token indices sequence length is longer than the specified maximum sequence length for this model (324381 > 2048). Running this sequence through the model will result in indexing errors
PPL:   0%|                                                                                                                                                                                                                       | 0/159 [00:00<?, ?it/s]/mnt/storage/slow_repos/TICO_1/TICO/tico/quantization/algorithm/spinquant/spin_llama.py:150: FutureWarning: `input_embeds` is deprecated and will be removed in version 5.6.0 for `create_causal_mask`. Use `inputs_embeds` instead.
  causal_mask = create_causal_mask(
PPL:  99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 158/159 [00:04<00:00, 33.52it/s]

┌── Wikitext-2 test perplexity ─────────────
│ FP32 :  7584.31
└───────────────────────────────────────────
Applying GPTQ …
Computing calibration set
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:00<00:00, 34.30it/s]
Calibrating sensitivity
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:04<00:00,  6.53it/s]
Saving calibrated_sensitivities to sensitivities_for_Maykeye_TinyLLama-v0_wikitext_32_42.pt
Quantizing layers: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 8/8 [00:12<00:00,  1.62s/layer]
Wrapping layers with PTQWrapper …                                                                                                                                                                                                                        
Calibrating PTQ obeservers…
  0%|                                                                                                                                                                                                                             | 0/32 [00:00<?, ?it/s]`use_return_dict` is deprecated! Use `return_dict` instead!
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 32/32 [00:17<00:00,  1.83it/s]
Saving PTQ model to PTQ_Maykeye_TinyLLama-v0_SpinQuant_GPTQ_smse_32_42.pt

Calculating perplexities …
PPL:  99%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▋ | 158/159 [00:59<00:00,  2.66it/s]

┌── Wikitext-2 test perplexity ─────────────
│ int16 :  7181.82
└───────────────────────────────────────────

TICO-DCO-1.0-Signed-off-by: s.malakhov s.malakhov@partner.samsung.com

@stamalakhov stamalakhov self-assigned this Apr 13, 2026
This PR saves torch artifacts to be used later in evaluation or quantization.

TICO-DCO-1.0-Signed-off-by: s.malakhov <s.malakhov@partner.samsung.com>
@stamalakhov stamalakhov force-pushed the save_torch_artifacts branch from 71fc131 to 750ab2d Compare April 13, 2026 09:46
@stamalakhov stamalakhov marked this pull request as ready for review April 13, 2026 09:51
@stamalakhov stamalakhov requested a review from mhs4670go April 13, 2026 09:51
"--save_torch_artifacts_to_folder",
type=str,
default=None,
help="Save all layers to the folder specified",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Help message should be changed. Current one seems to be copied from other option.

save_name = get_ptq_model_name(model, args)
save_path = pathlib.Path(args.save_torch_artifacts_to_folder, save_name)
print(f"Saving PTQ model to {save_path}")
torch.save(q_m, save_path)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you let me know what this saving is for?

Addionally, saving the entire module object with torch.save(q_m, ...) may be fragile across code changes and environments.

Copy link
Copy Markdown
Contributor Author

@stamalakhov stamalakhov Apr 13, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mhs4670go
please see #626, we can use it to evaluate saved model. gptq can be time consuming, as well as evaluation of quantized model. So if we have saved model, we can later reevaluate it using different set of benchmarks. Currently it's supposed to have single environment.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. How about adding a comment like this above the function call?

# Save quantized model for later re-evaluation (used by eval script to skip re-quantization)
torch.save(q_m, save_path)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok.

help="Save all layers to the folder specified",
)
parser.add_argument(
"--save_torch_artifacts_to_folder",
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

save_torch_artifacts_to_folder feels too broad and a bit misleading for what this option actually does.

Currently, this single flag is used to save multiple outputs (e.g., GPTQ sensitivities and a PTQ model checkpoint), which have different purposes and lifecycles.

Also, this is inconsistent with the existing saving options (save_circle_to_folder, save_layers_to_folder), which are all explicit about what they save. In contrast, this new option is abstract and mixes multiple artifact types under one flag.

It would be better to use a more specific name (e.g., save_quant_artifacts_to_folder) so that the behavior is more explicit, predictable, and consistent with the existing CLI design.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I’m also a bit concerned about the overall save interface as it seems to be growing option by option.

At the moment, the save-related flags do not follow a single clear abstraction: some are named by format (circle), some by granularity (layers), and this new one by a very broad implementation-oriented term (torch artifacts). As more save targets get added, this may become harder to understand and maintain from the CLI side.

It may be worth considering a more structured interface, for example:

  • one output directory option, and
  • a separate argument that explicitly lists which artifacts to save

This would likely scale better than continuing to add one save flag per artifact/output type.

One of examples would be like below.

--output_dir ./outputs
--save sensitivity,ptq_checkpoint
# OR
--save model_circle sensitivity ptq_checkpoint

If you don't have enough time for refactoring, you can just add a comment as TODO.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh. Ok. Let it be one folder with multiple options. I'll rework this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants