Igniting Generative Power: Multi-Token LLMs for Advanced Text Summarization

cover
23 Jul 2025

Abstract and 1. Introduction

2. Method

3. Experiments on real data

4. Ablations on synthetic data

5. Why does it work? Some speculation

6. Related work

7. Conclusion, Impact statement, Environmental impact, Acknowledgements and References

A. Additional results on self-speculative decoding

B. Alternative architectures

C. Training speeds

D. Finetuning

E. Additional results on model scaling behavior

F. Details on CodeContests finetuning

G. Additional results on natural language benchmarks

H. Additional results on abstractive text summarization

I. Additional results on mathematical reasoning in natural language

J. Additional results on induction learning

K. Additional results on algorithmic reasoning

L. Additional intuitions on multi-token prediction

M. Training hyperparameters

H. Additional results on abstractive text summarization

In this section, we report comprehensive evaluation results on summarization tasks for the 7B parameter models trained on 200B and 500B tokens of natural language from Section 3.7.

Table S8: Comprehensive evaluation on abstractive text summarization. ROUGE-n (n-gram overlap) and ROUGE-L (longest common subsequence overlap) F1 scores for 7B models trained on 200B and 500B tokens of natural language, respectively. The last three columns correspond to models trained on 500B tokens, the previous three to models trained on 200B tokens. Shown are numbers of the n = 1 baseline and the absolute difference of n = 2 and n = 4 models trained on the same number of tokens. Summary-level ROUGE-L (“ROUGE-Lsum”) is reported where it differs from ROUGE-L. Model checkpoints with maximal validation ROUGE-L F1 are selected separately for each model dataset and model type and reported in the first row corresponding to each dataset. Boldface for numbers within 0.05 difference to the best one for each dataset size separately

Table S9: Performance on abstractive text summarization. ROUGE-L (longest common subsequence overlap) F1 score for 7B models trained on 200B and 500B tokens of natural language. We finetune the respective models on each task’s training data separately for a given number of epochs and select the checkpoints with maximal ROUGE-L F1 on the validation dataset. The second and fifth column report the numbers for a next-token prediction model, while the third, fourth, sixth and seventh one report the absolute improvements for 2-token and 4-token prediction models trained on the same amount of data, respectively. Boldface for numbers within 0.05 difference to the best one for each dataset size separately.

Table S10: Summary statistics for abstractive text summarization evaluations. Reported are averages for ROUGE-n and ROUGE-L metrics across all datasets from Table S8, separately for precision, recall and F1 score. Both 2-token and 4-token prediction models outperform the next-token prediction baseline. Trained on 500B tokens, 4-token prediction models appear better at recall metrics while 2-token prediction models appear better at precision metrics. Model checkpoints are selected as described in Table S8. Boldface for numbers within 0.05 difference to the best one for each dataset size separately.

Authors:

(1) Fabian Gloeckle, FAIR at Meta, CERMICS Ecole des Ponts ParisTech and Equal contribution;

(2) Badr Youbi Idrissi, FAIR at Meta, LISN Université Paris-Saclayand and Equal contribution;

(3) Baptiste Rozière, FAIR at Meta;

(4) David Lopez-Paz, FAIR at Meta and a last author;

(5) Gabriel Synnaeve, FAIR at Meta and a last author.


This paper is available on arxiv under CC BY 4.0 DEED license.