Skip to content

Swiglu gating_ QAT_Residual Attention Scalin _EMA- Sliding window_Optimizations for 10min/16MB Track#2159

Open
visin109 wants to merge 3 commits into
openai:mainfrom
visin109:main
Open

Swiglu gating_ QAT_Residual Attention Scalin _EMA- Sliding window_Optimizations for 10min/16MB Track#2159
visin109 wants to merge 3 commits into
openai:mainfrom
visin109:main

Conversation

@visin109
Copy link
Copy Markdown

@visin109 visin109 commented May 6, 2026

Submission for "track_10min_16mb"

This PR contains my experimental submission for the "track_10min_16mb" Parameter Golf challenge.

The work focuses on improving compression-aware transformer training under strict constraints:

  • ≤10 minute training budget
  • ≤16MB compressed artifact
  • Optimized for validation BPB

Key Explored Modifications

  1. Quantization-Aware Training (STE-based fake quantization)
  2. SwiGLU / gated MLP replacement over standard ReLU²
  3. Adaptive residual mixing with learnable skip blending
  4. Learnable per-head attention query scaling ("q_gain")
  5. Per-channel residual scaling for attention and MLP outputs
  6. EMA-based weight tracking for smoother evaluation
  7. SWA stabilization during late training stages
  8. Encoder-decoder style skip connection reuse
  9. Sliding-window BPB validation strategy
  10. Multi-optimizer training setup (Muon + Adam parameter grouping)
  11. Muon momentum warmup scheduling
  12. Warmup / warmdown learning-rate scheduling
  13. Int8 + zlib compression tuning
  14. Per-row / per-tensor quantization strategy
  15. Percentile clipping for stable int8 export
  16. Small-tensor FP16 passthrough optimization
  17. GQA (Grouped Query Attention) efficiency experiments
  18. Rotary positional embeddings (RoPE)
  19. Hyperparameter tuning for short-budget convergence efficiency

Improvements over the Baseline

The initial baseline configuration produced unstable convergence and poor compression efficiency (~2.0+ BPB range). Through iterative architectural experimentation and hyperparameter tuning, training stability and post-quantization consistency were significantly improved.

Key improvements achieved:

  • Lower quantization degradation after export
  • Faster convergence under strict wallclock limits
  • Improved gradient flow using adaptive residual scaling
  • Better attention stability using learnable query scaling
  • More stable evaluation through EMA-based weight tracking
  • Improved compression efficiency under int8 export constraints

Final Observed Results

  • Validation BPB: ~1.599
  • Compressed model size: ~7.3 MB
  • Training budget: ~10 minutes
  • Successful int8 + zlib export validation

Resource Note

Due to limited H100 GPU availability during the final phase, several validation iterations were performed on a local consumer GPU setup using reduced validation subsets for faster experimentation.

This branch represents an active experimental pipeline and further architectural exploration is ongoing.

visin109 and others added 3 commits May 1, 2026 02:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant