cover

Fueling Next-Gen LLMs: Data-Driven Hyperparameter Setups Revealed

25 Jul 2025

This section unveils the data-driven hyperparameter configurations essential for training powerful LLMs, covering specific setups for model scaling, byte-level

cover

Decoding the Magic: Multi-Token Prediction's Information-Theoretic Edge & Beyond

25 Jul 2025

Explore the sophisticated mechanisms driving multi-token prediction. This section rigorously explains its edge via information-theoretic mutual information

cover

Multi-Token Prediction: Mastering Algorithmic Reasoning with Enhanced Resource Use

23 Jul 2025

Discover how multi-token prediction improves LLM algorithmic reasoning, potentially by learning to allocate computational resources more efficiently

cover

Redefining Induction: Multi-Token vs. Next-Token on High-Quality LLM Data

23 Jul 2025

This figure showcases how training on higher-quality data enforces early induction capability.

cover

Strategic LLM Training: Multi-Token Prediction's Data Efficiency in Mathematical Reasoning

23 Jul 2025

This figure illustrates the profound impact of training scale on multi-token prediction models' performance on GSM8K, highlighting critical data efficiency

cover

Igniting Generative Power: Multi-Token LLMs for Advanced Text Summarization

23 Jul 2025

This section reports on the significant strides made in abstractive text summarization by 7B multi-token prediction models.

cover

Multi-Token Prediction: Exploring Performance on NLP Benchmarks

23 Jul 2025

Discover how multi-token prediction models fare on specific NLP benchmark categories, including those testing common sense reasoning and factual knowledge.

cover

Real-World Code Performance: Multi-Token Finetuning on CodeContests

22 Jul 2025

This section outlines the practical evaluation of multi-token pretrained models finetuned on the CodeContests dataset, assessing their real-world coding

cover

Deep Dive into LLM Scaling: Multi-Token Prediction's Impact on Coding Accuracy

22 Jul 2025

This section provides a meticulous analysis (Table S7) of multi-token prediction's influence on LLM scaling behavior, detailing pass@k metrics on MBPP