-
Notifications
You must be signed in to change notification settings - Fork 213
Kimi-k2 calib+export #655
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Kimi-k2 calib+export #655
Conversation
Signed-off-by: Jingyu Xin <[email protected]>
Signed-off-by: Jingyu Xin <[email protected]>
| NVFP4_MLP_EXPERTS_ONLY_CFG = { | ||
| "quant_cfg": { | ||
| "*mlp.experts*weight_quantizer": { | ||
| "num_bits": (2, 1), | ||
| "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)}, | ||
| "enable": True, | ||
| "pass_through_bwd": True, | ||
| }, | ||
| "*mlp.experts*input_quantizer": { | ||
| "num_bits": (2, 1), | ||
| "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)}, | ||
| "enable": True, | ||
| "pass_through_bwd": True, | ||
| }, | ||
| **_default_disabled_quantizer_cfg, | ||
| }, | ||
| "algorithm": "max", | ||
| } | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We already have this config. See NVFP4_MLP_ONLY_CFG
| NVFP4_MLP_EXPERTS_ONLY_CFG = { | |
| "quant_cfg": { | |
| "*mlp.experts*weight_quantizer": { | |
| "num_bits": (2, 1), | |
| "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)}, | |
| "enable": True, | |
| "pass_through_bwd": True, | |
| }, | |
| "*mlp.experts*input_quantizer": { | |
| "num_bits": (2, 1), | |
| "block_sizes": {-1: 16, "type": "dynamic", "scale_bits": (4, 3)}, | |
| "enable": True, | |
| "pass_through_bwd": True, | |
| }, | |
| **_default_disabled_quantizer_cfg, | |
| }, | |
| "algorithm": "max", | |
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that will quantize mlp.shared_experts
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is great. Let's not creating more cfgs
|
|
||
| > *This is a subset of the models supported. For the full list please check the [TensorRT-LLM support matrix](https://nvidia.github.io/TensorRT-LLM/reference/precision.html#support-matrix)* | ||
| > We recommend upcasting Kimi-K2-Thinking from INT4 to BF16 before running quantization. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it a recommendation or it's something we have to do? An alterantive is to up cast the in4 to BF16 during calibration like we did with DS.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But there’s no INT4 support in PyTorch, as we discussed. People have to use vLLM if they want INT4. Me and Zhiyu are looking into the vLLM calibration of this model
| > We recommend upcasting Kimi-K2-Thinking from INT4 to BF16 before running quantization. | ||
| ```python | ||
| from transformers import AutoModelForCausalLM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you move these to the examples_utils.py like: https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/llm_ptq/example_utils.py#L303
examples/llm_ptq/hf_ptq.py
Outdated
| "w4a8_nvfp4_fp8": mtq.W4A8_NVFP4_FP8_CFG, | ||
| "w4a8_mxfp4_fp8": mtq.W4A8_MXFP4_FP8_CFG, | ||
| "nvfp4_mlp_only": mtq.NVFP4_MLP_ONLY_CFG, | ||
| "nvfp4_mlp_experts_only": mtq.NVFP4_MLP_EXPERTS_ONLY_CFG, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still against adding more configs here. I think this MR we should just stick with MLP_only if we have to. People can tune the recipe themselves if they want to do experts only.
If you really like to add this config, let's name it experts_only for short. experts are always in MLP
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Renamed it to nvfp4_experts_only. Let’s keep this config for now; once the YAML config system is released, we can avoid using these recipe dictionaries.
Signed-off-by: Jingyu Xin <[email protected]>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #655 +/- ##
==========================================
- Coverage 74.58% 74.57% -0.02%
==========================================
Files 183 183
Lines 18451 18452 +1
==========================================
- Hits 13762 13760 -2
- Misses 4689 4692 +3 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Jingyu Xin <[email protected]>
What does this PR do?
Type of change: new example
Overview: Support Kimi-k2 Calibration
Usage
Testing
Before your PR is "Ready for review"
Additional Information