Swiglu gating_ QAT_Residual Attention Scalin _EMA- Sliding window_Optimizations for 10min/16MB Track#2159
Open
visin109 wants to merge 3 commits into
Open
Swiglu gating_ QAT_Residual Attention Scalin _EMA- Sliding window_Optimizations for 10min/16MB Track#2159visin109 wants to merge 3 commits into
visin109 wants to merge 3 commits into
Conversation
Updated script name and added evaluation notes regarding resource constraints during testing.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Submission for "track_10min_16mb"
This PR contains my experimental submission for the "track_10min_16mb" Parameter Golf challenge.
The work focuses on improving compression-aware transformer training under strict constraints:
Key Explored Modifications
Improvements over the Baseline
The initial baseline configuration produced unstable convergence and poor compression efficiency (~2.0+ BPB range). Through iterative architectural experimentation and hyperparameter tuning, training stability and post-quantization consistency were significantly improved.
Key improvements achieved:
Final Observed Results
Resource Note
Due to limited H100 GPU availability during the final phase, several validation iterations were performed on a local consumer GPU setup using reduced validation subsets for faster experimentation.
This branch represents an active experimental pipeline and further architectural exploration is ongoing.