avatar
cosmological
cosmological's Blog
Cosmological thinking: time, space and universal causation
cosmological
cosmological's Blog
Cosmological thinking: time, space and universal causation
  • cosmological.TECH
  • Read My Stories
About Cosmological thinking: time, space and universal causation
Login
  • About Cosmological thinking: time, space and universal causation
  • Login

Fueling Next-Gen LLMs: Data-Driven Hyperparameter Setups Revealed

cover
25 Jul 2025

Table of Links

Abstract and 1. Introduction

2. Method

3. Experiments on real data

4. Ablations on synthetic data

5. Why does it work? Some speculation

6. Related work

7. Conclusion, Impact statement, Environmental impact, Acknowledgements and References

A. Additional results on self-speculative decoding

B. Alternative architectures

C. Training speeds

D. Finetuning

E. Additional results on model scaling behavior

F. Details on CodeContests finetuning

G. Additional results on natural language benchmarks

H. Additional results on abstractive text summarization

I. Additional results on mathematical reasoning in natural language

J. Additional results on induction learning

K. Additional results on algorithmic reasoning

L. Additional intuitions on multi-token prediction

M. Training hyperparameters

M. Training hyperparameters

Table S13: Overview of all training hyperparameters used. We schedule all learning rates with a linear warmup and cosine decay (Loshchilov and Hutter, 2017) to a fraction of the peak learning rate which is depicted in the last column (“decay ratio”). All experiments use the Adam (Kingma and Ba, 2015) optimizer with β1 = 0.9, β2 = 0.95 and decoupledL2 weight decay (Loshchilov and Hutter, 2019) coefficient 0.1. We clip gradients to a maximal Euclidean norm of 1.0 in all experiments except CodeContests finetunings, where we use 0.1 instead. Summarization finetunings correspond to three epochs on all datasets except BigPatent (1 epoch). Byte-level models use the architecture with replicated unembeddings from Appendix B.

Table S14: Overview of model architectures used for scaling analyses.

Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech and Equal contribution;

(2) Badr Youbi Idrissi, FAIR at Meta, LISN Université Paris-Saclayand and Equal contribution;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and a last author;

(5) Gabriel Synnaeve, FAIR at Meta and a last author.


This paper is available on arxiv under CC BY 4.0 DEED license.


← Previous

Decoding the Magic: Multi-Token Prediction's Information-Theoretic Edge & Beyond

avatar
cosmological
cosmological's Blog
Cosmological thinking: time, space and universal causation
cosmological
cosmological's Blog
Cosmological thinking: time, space and universal causation
  • About
  • Stories
  • Random Story
  • Terms
  • Privacy
  • Publish Your Story