
Fueling Next-Gen LLMs: Data-Driven Hyperparameter Setups Revealed
25 Jul 2025
This section unveils the data-driven hyperparameter configurations essential for training powerful LLMs, covering specific setups for model scaling, byte-level

Decoding the Magic: Multi-Token Prediction's Information-Theoretic Edge & Beyond
25 Jul 2025
Explore the sophisticated mechanisms driving multi-token prediction. This section rigorously explains its edge via information-theoretic mutual information

Multi-Token Prediction: Mastering Algorithmic Reasoning with Enhanced Resource Use
23 Jul 2025
Discover how multi-token prediction improves LLM algorithmic reasoning, potentially by learning to allocate computational resources more efficiently

Redefining Induction: Multi-Token vs. Next-Token on High-Quality LLM Data
23 Jul 2025
This figure showcases how training on higher-quality data enforces early induction capability.

Strategic LLM Training: Multi-Token Prediction's Data Efficiency in Mathematical Reasoning
23 Jul 2025
This figure illustrates the profound impact of training scale on multi-token prediction models' performance on GSM8K, highlighting critical data efficiency

Igniting Generative Power: Multi-Token LLMs for Advanced Text Summarization
23 Jul 2025
This section reports on the significant strides made in abstractive text summarization by 7B multi-token prediction models.

Multi-Token Prediction: Exploring Performance on NLP Benchmarks
23 Jul 2025
Discover how multi-token prediction models fare on specific NLP benchmark categories, including those testing common sense reasoning and factual knowledge.

Real-World Code Performance: Multi-Token Finetuning on CodeContests
22 Jul 2025
This section outlines the practical evaluation of multi-token pretrained models finetuned on the CodeContests dataset, assessing their real-world coding

Deep Dive into LLM Scaling: Multi-Token Prediction's Impact on Coding Accuracy
22 Jul 2025
This section provides a meticulous analysis (Table S7) of multi-token prediction's influence on LLM scaling behavior, detailing pass@k metrics on MBPP