Fueling Next-Gen LLMs: Data-Driven Hyperparameter Setups Revealed

$Table S13: Overview of all training hyperparameters used. We schedule all learning rates with a linear warmup and cosine decay (Loshchilov and Hutter, 2017) to a fraction of the peak learning rate which is depicted in the last column (“decay ratio”). All experiments use the Adam (Kingma and Ba, 2015) optimizer with β1 = 0.9, β2 = 0.95 and decoupledL2 weight decay (Loshchilov and Hutter, 2019) coefficient 0.1. We clip gradients to a maximal Euclidean norm of 1.0 in all experiments except CodeContests finetunings, where we use 0.1 instead. Summarization finetunings correspond to three epochs on all datasets except BigPatent (1 epoch). Byte-level models use the architecture with replicated unembeddings from Appendix B.$

Table S14: Overview of model architectures used for scaling analyses.

Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech and Equal contribution;

(2) Badr Youbi Idrissi, FAIR at Meta, LISN Université Paris-Saclayand and Equal contribution;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and a last author;

(5) Gabriel Synnaeve, FAIR at Meta and a last author.

This paper is available on arxiv under CC BY 4.0 DEED license.

← Previous

Decoding the Magic: Multi-Token Prediction's Information-Theoretic Edge & Beyond