Dolma 3 consists of three datasets constructed for the OLMo 3 family of models: Dolma 3 Mix, a diverse 5.9T-token pre-training dataset, Dolma 3 Dolmino Mix, a 100B-token mid-training dataset targeting performance improvements in math, code, QA, instruction and thinking, and Dolma 3 Longmino Mix, 50B tokens of long context data. This repository contains descriptions and code necessary for reconstructing the Dolma 3 datasets.
For further details, please refer to the OLMo 3 paper and the OLMo 3 website.