Non record: Progressive context growth precursor to PR 2014, 12 hours on RTX 4090, val_bpb 0.9697 pre-quant#2144
Open
simonbissonnette wants to merge 1 commit into
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Progressive context growth precursor to PR 2014, 12 hours on RTX 4090, val_bpb 0.9697 pre-quant
This is an archival non-record submission package for a 12-hour RTX 4090 run.
I ran this for a personal project, but I think the result is interesting so I decided to share even if we are past the deadline.
Main differences with PR 2014
Result
The exact logged final metric is:
Notes:
EMA_ENABLED=0in the config, despite the historical log string sayingpost-ema.SKIP_FINAL_PACKAGING=1, so no final compressed 16MB package was produced.final_int6_roundtrip_exactline should beread as a no-packaging roundtrip/check value, not as a produced compressed
int6 submission artifact.
not committed to this folder.
Model And Training Setup
35,944,5368192, tokenizerfineweb_8192_bpe.model115128, KV heads:44.00.35,loop_start=3,loop_end=5,num_loops=243200s38707/100000262144131072819240968epochs,32768chunk tokens, SGD LR0.005Progressive context schedule:
Midrun LR cap schedule:
Dataset
The run used a pretrain mixture described in
castor_pretrain_mix_v0.yaml:The pretokenized output path in the original run was:
The tokenizer path was:
Reproduction Command
From a workspace that contains the raw data and tokenizer:
The wrapper prepares the pretokenized shards if needed, then launches:
Included Files
train_seed1337.log: exact historical trainer logl7grow_v4_castor_12h.env: exact run environment/configcastor_pretrain_mix_v0.yaml: dataset mixture configtrain_l7grow_v4_castor_12h.sh: wrapper entrypointtrain_l7grow_v4_castor.sh: underlying Castor launch scripttrain_gpt.py: Wrappertrain_gpt_human.py: Codeenv_utils.py: env-file loader used by the trainerARTIFACTS.md: local paths and hashes for retained uncommitted weightssubmission.json: metadata for this non-record archive