The CORRECTNESS_ONLY_STAGES set in optimize_stage is currently hardcoded to {}, disabling the skip-speedup-check behavior for all stages.
Originally intended to allow stages like DTYPE_FIX and ALGORITHMIC to produce kernels that are temporarily slower (correctness fix first, performance stages recover later). This makes sense when casting from fp16→fp32 or fp16→bf16 — the cast itself may be slightly slower but is required for correctness.
Proposal: Either remove the dead code entirely, or expose it as CORRECTNESS_ONLY_STAGES = {OptimizationStage.DTYPE_FIX} with a comment explaining why, and add a --skip-speedup-stages CLI option.
The CORRECTNESS_ONLY_STAGES set in optimize_stage is currently hardcoded to {}, disabling the skip-speedup-check behavior for all stages.
Originally intended to allow stages like DTYPE_FIX and ALGORITHMIC to produce kernels that are temporarily slower (correctness fix first, performance stages recover later). This makes sense when casting from fp16→fp32 or fp16→bf16 — the cast itself may be slightly slower but is required for correctness.
Proposal: Either remove the dead code entirely, or expose it as CORRECTNESS_ONLY_STAGES = {OptimizationStage.DTYPE_FIX} with a comment explaining why, and add a --skip-speedup-stages CLI option.