Enable post-RHT amax estimation with separate amax scale kernel#2521
Enable post-RHT amax estimation with separate amax scale kernel#2521negvet wants to merge 5 commits intoNVIDIA:mainfrom
Conversation
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
|
/te-ci |
|
Having a separate amax scale API automatically means that we need a grouped API for MOE and therefore not a good choice moving forward. Can we fuse the amax scaling as part of the FP4 quantize kernels via the quantization config struct? Whenever we change the numeric for Dense, we have to change it for MOE. |
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Signed-off-by: Evgeny <etsykunov@nvidia.com>
This is functional with commont entry point:
This is single explicit step that applies everywhere and reduce the chance we forget to scale in the new op/refactor. I agree that extra launch cost in the loop is an issue (now, What about compute all per-split pre‑RHT amaxes in one grouped call instead of calling |
for more information, see https://pre-commit.ci
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: