Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Retained Local Artifacts

Large model artifacts are not committed in this non-record folder. They are
currently retained locally at:

| Artifact | Path | Bytes | SHA256 |
|---|---|---:|---|
| final full-precision model | `/home/simon/castorv2/runs/castor_l7grow_v4_12h_seed1337/final_model.pt` | 135431355 | `02959aa988dd1668ca696ce1a0058309ea4fe52d3505f2a560f5240d74f6bac9` |
| final full-precision snapshot | `/home/simon/castorv2/logs/castor_l7grow_v4_12h_seed1337.final_model_snapshot.pt` | 135431355 | `02959aa988dd1668ca696ce1a0058309ea4fe52d3505f2a560f5240d74f6bac9` |
| latest training checkpoint | `/home/simon/castorv2/runs/castor_l7grow_v4_12h_seed1337/checkpoints/latest.pt` | 286390027 | `734a0f69d377a96439ae3ba8a4814741c5b270f6ef921e912f10ed48f93e4466` |

The run used `SKIP_FINAL_PACKAGING=1`, so no final compressed int6 `.ptz`
artifact was produced for this archive.
Original file line number Diff line number Diff line change
@@ -0,0 +1,114 @@
# L7 growth v4 precursor to PR 2014, 12 hours on RTX 4090, val_bpb 0.9697 pre-quant

This is an archival **non-record** submission package for a 12-hour Castor
pretraining run based on the l7 growth v4 recipe.

I ran this for a personal project, but I think the result is interesting so I decided to share even if we are past the deadline.

# Main differences with PR 2014
- Max context size was 8k instead of 3k
- I didn't pre compile the context size since the cost of compilation on a 12 hours run is not significant.
- I used a customized LR curve that I didn't include in PR 2014 since it doesn't quantize well
- No EMA
- Dataset used are different and are detailed in the .yaml file included

## Result

The exact logged final metric is:

```text
pre-quantization post-ema val_loss:1.83671792 val_bpb:0.96976490 eval_time:184477ms
final_int6_roundtrip_exact val_loss:1.83671792 val_bpb:0.96976490 skipped_packaging:1
```

Notes:

- `EMA_ENABLED=0` in the config, despite the historical log string saying
`post-ema`.
- `SKIP_FINAL_PACKAGING=1`, so no final compressed 16MB package was produced.
- Because packaging was skipped, the `final_int6_roundtrip_exact` line should be
read as a no-packaging roundtrip/check value, not as a produced compressed
int6 submission artifact.
- The retained full-precision model is 135,431,355 bytes and is intentionally
not committed to this folder.

## Model And Training Setup

- Parameters: `35,944,536`
- Vocabulary: `8192`, tokenizer `fineweb_8192_bpe.model`
- Layers: `11`
- Model dim: `512`
- Heads: `8`, KV heads: `4`
- MLP multiplier: `4.0`
- Looping: enabled at `0.35`, `loop_start=3`, `loop_end=5`, `num_loops=2`
- Training wallclock cap: `43200s`
- Stopped at step `38707/100000`
- Training batch tokens: `262144`
- Validation batch tokens: `131072`
- Eval context: `8192`
- Eval stride: `4096`
- TTT: enabled, `8` epochs, `32768` chunk tokens, SGD LR `0.005`

Progressive context schedule:

```text
1024@0.200,2048@0.750,4096@0.850,8192@1.000
```

Midrun LR cap schedule:

```text
1.000@0.000,1.000@0.400,0.500@0.400,0.300@0.500,0.180@0.600,0.110@0.700,0.090@0.800,0.070@1.000
```

## Dataset

The run used a pretrain mixture described in
`castor_pretrain_mix_v0.yaml`:

- FineWeb English
- FineWeb2 French
- FineWeb-Edu English
- optional CommitPack code shards

The pretokenized output path in the original run was:

```text
./data/datasets/castor_pretrain_sp8192_v0
```

The tokenizer path was:

```text
./data/tokenizers/fineweb_8192_bpe.model
```

## Reproduction Command

From a workspace that contains the raw data and tokenizer:

```bash
CASTOR_TRAIN_ENV=./configs/train/l7grow_v4_castor_12h.env \
./scripts/train_l7grow_v4_castor_12h.sh
```

The wrapper prepares the pretokenized shards if needed, then launches:

```bash
SIMON_ENV_FILE=./configs/train/l7grow_v4_castor_12h.env \
./.venv/bin/python -u trainers/l7_grow/train_gpt.py
```


## Included Files

- `train_seed1337.log`: exact historical trainer log
- `l7grow_v4_castor_12h.env`: exact run environment/config
- `castor_pretrain_mix_v0.yaml`: dataset mixture config
- `train_l7grow_v4_castor_12h.sh`: wrapper entrypoint
- `train_l7grow_v4_castor.sh`: underlying Castor launch script
- `train_gpt.py`: Wrapper
- `train_gpt_human.py`: Code
- `env_utils.py`: env-file loader used by the trainer
- `ARTIFACTS.md`: local paths and hashes for retained uncommitted weights
- `submission.json`: metadata for this non-record archive
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
version: 1
name: castor_pretrain_mix_v0
description: |
Conversion des sources JSONL Castor vers le format .bin attendu par l7.grow.

tokenizer_path: data/tokenizers/fineweb_8192_bpe.model
vocab_size: 8192
output_dir: data/datasets/castor_pretrain_sp8192_v0
shard_size_tokens: 100000000
val_ratio: 0.005
append_eos: false
batch_size: 1024
seed: 1337

sources:
- name: fineweb_en_v0
glob: data/pretrain/raw/fineweb_en_v0/shards/*.jsonl
required: true
kind: text

- name: fineweb2_fr_v0
glob: data/pretrain/raw/fineweb2_fr_v0/shards/*.jsonl
required: true
kind: text

- name: fineweb_edu_en_v0
glob: data/pretrain/raw/fineweb_edu_en_v0/shards/*.jsonl
required: true
kind: text

- name: commitpack_code_v0
glob: data/pretrain/raw/commitpack_code_v0/**/*.jsonl
required: false
kind: code
Original file line number Diff line number Diff line change
@@ -0,0 +1,75 @@
from __future__ import annotations

import os
from pathlib import Path

PATH_LIKE_KEYS = frozenset(
{
"DATA_DIR",
"TOKENIZER_PATH",
"SAMPLE_CHECKPOINT",
"DATASETS_DIR",
"TRAIN_FILES",
"VAL_FILES",
"RUN_DIR",
"CHECKPOINT_DIR",
"RESUME_CHECKPOINT",
"INIT_MODEL_PATH",
"MODEL_PATH",
"QUANTIZED_MODEL_PATH",
"LOGFILE",
}
)


def resolve_path_value(script_dir: Path, raw_value: str) -> str:
path = Path(raw_value.strip())
if path.is_absolute():
return str(path)

candidates = [
(script_dir / path).resolve(),
(script_dir.parent / path).resolve(),
]
for candidate in candidates:
if candidate.exists():
return str(candidate)
return str(candidates[0])


def load_env_file(script_dir: Path, filename: str = ".env") -> None:
env_path = Path(filename)
if not env_path.is_absolute():
candidates = [
(script_dir / env_path).resolve(),
(script_dir.parent / env_path).resolve(),
(Path.cwd() / env_path).resolve(),
]
for candidate in candidates:
if candidate.is_file():
env_path = candidate
break
else:
env_path = candidates[0]
if not env_path.is_file():
return

for raw_line in env_path.read_text(encoding="utf-8").splitlines():
line = raw_line.strip()
if not line or line.startswith("#"):
continue
if line.startswith("export "):
line = line[7:].lstrip()
if "=" not in line:
continue

key, value = line.split("=", 1)
key = key.strip()
value = value.strip()
if not key or key in os.environ:
continue
if len(value) >= 2 and value[0] == value[-1] and value[0] in {"'", '"'}:
value = value[1:-1]
if key in PATH_LIKE_KEYS and value:
value = resolve_path_value(script_dir, value)
os.environ[key] = value
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Castor v2 12-hour pretrain phase starting from Simon's l7.grow v4 snapshot
# on the Castor EN/FR + code pretrain mix.

RUN_ID=castor_l7grow_v4_12h_seed1337
SEED=1337

DATA_DIR=./data
DATASETS_DIR=./data/datasets/castor_pretrain_sp8192_v0
TOKENIZER_PATH=./data/tokenizers/fineweb_8192_bpe.model

RUN_DIR=./runs/castor_l7grow_v4_12h_seed1337
CHECKPOINT_DIR=./runs/castor_l7grow_v4_12h_seed1337/checkpoints
RESUME_CHECKPOINT=./runs/castor_l7grow_v4_12h_seed1337/checkpoints/latest.pt
INIT_MODEL_PATH=./checkpoints/bootstrap/l7grow_v4_seed1337_init.pt
MODEL_PATH=./runs/castor_l7grow_v4_12h_seed1337/final_model.pt
QUANTIZED_MODEL_PATH=./runs/castor_l7grow_v4_12h_seed1337/final_model.int6.ptz
LOGFILE=./logs/castor_l7grow_v4_12h_seed1337.txt

VOCAB_SIZE=8192
MAX_WALLCLOCK_SECONDS=43200
ITERATIONS=100000
SAVE_CHECKPOINT_EVERY=1000
KEEP_STEP_CHECKPOINTS=0

TRAIN_BATCH_TOKENS=262144
VAL_BATCH_TOKENS=131072
VAL_LOSS_EVERY=8000
TRAIN_LOG_EVERY=500

TRAIN_SEQ_LEN=8192
ROPE_TRAIN_SEQ_LEN=8192
TRAIN_SEQ_SCHEDULE=1024@0.200,2048@0.750,4096@0.850,8192@1.000
TRAIN_SEQ_SCHEDULE_MODE=wallclock
SEQ_CHANGE_WARMUP_STEPS=32

MIDRUN_CAP_SCHEDULE=1.000@0.000,1.000@0.400,0.500@0.400,0.300@0.500,0.180@0.600,0.110@0.700,0.090@0.800,0.070@1.000
WARMDOWN_ITERS=12800

EMA_ENABLED=0
COMPILE_MODEL=1
COMPILE_DYNAMIC=1
DYNAMO_CACHE_SIZE_LIMIT=64

EVAL_SEQ_LEN=8192
EVAL_STRIDE=4096
SLIDING_WINDOW_ENABLED=1
TTT_ENABLED=1
TTT_EPOCHS=8
TTT_CHUNK_TOKENS=32768

SKIP_FINAL_PACKAGING=1
SAVE_PRE_QUANT_SNAPSHOT=1
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"author": "Simon Bissonnette",
"github_id": "simonbissonnette",
"name": "L7 growth v4 precursor to PR 2014, 12 hours on RTX 4090, 0.9697 val_bpb",
"track": "non-record-unlimited-compute-archival",
"date": "2026-04-15",
"val_loss": 1.83671792,
"val_bpb": 0.96976490,
"model_params": 35944536,
"step_stop": 38707,
"iterations": 100000,
"wallclock_seconds": 43189.612,
"max_wallclock_seconds": 43200,
"artifact_bytes_full_precision": 135431355,
"bytes_total": null,
"packaging_skipped": true,
"blurb": "Archival non-record experiment from a personal Castor run: l7 growth v4 recipe, 35.9M params, SP8192 tokenizer, progressive context growth 1k->2k->4k->8k, a midrun LR cap, no EMA, and legal TTT. Final logged val_bpb is 0.96976490 pre-quant/check-only; final packaging was skipped, so this is shared as an interesting long-compute reference rather than a main-track 16MB record package."
}
Original file line number Diff line number Diff line change
@@ -0,0 +1,5 @@
from train_gpt_human import main


if __name__ == "__main__":
main()
Loading