Optimizing Pretraining Datasets for Large Language Models Through Recursive Perplexity Correlations

Samuel Audrad
James Sullivan
Olivia Bennett
Xavier Rossi

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The exponential expansion of textual data has catalyzed the development of sophisticated language models, enabling unprecedented levels of language comprehension and generation. Introducing perplexity correlations as a mechanism for optimizing pretraining datasets constitutes a groundbreaking and substantial advancement, significantly enhancing the quality and effectiveness of the data employed in language model training. Through the careful analysis of perplexity scores in relation to various dataset attributes, this novel approach facilitates the identification and refinement of high-quality data segments, thereby constructing a more coherent and representative training corpus. The application of this methodology to the Mistral Large Language Model resulted in notable reductions in perplexity scores and marked improvements in performance across standard benchmarking evaluations. These enhancements demonstrate the critical importance of data quality in the training of language models and demonstrate that strategic dataset optimization via perplexity correlations can substantially elevate model performance and reliability. Consequently, this study provides a robust framework for ongoing advancements in language model development, emphasizing the significant role of data refinement in achieving superior natural language processing capabilities.

Version published to 10.21203/rs.3.rs-5080108/v1 on Research Square
Sep 16, 2024

Bridging Finance and AI: A Comprehensive Survey of Large Language Models in Financial Applications

This article has 3 authors:
1. Ameer Tamoor Khak
2. Shuai Li
3. Xinwei Cao
This article has no evaluationsLatest version May 14, 2025
Parsing Old English with Universal Dependencies. The Impact of Model Architectures and Dataset Sizes

This article has 3 authors:
1. Javier Martín Arista
2. Ana Elvira Ojanguren López
3. Sara Domínguez Barragán
This article has no evaluationsLatest version May 29, 2025
A Comparative Survey of Large Language Models: Foundation, Instruction-Tuned, and Multimodal Variants

This article has 2 authors:
1. Owen Graham
2. Jim Balford
This article has no evaluationsLatest version Jun 13, 2025

Listed in

Abstract

Article activity feed

Related articles

Bridging Finance and AI: A Comprehensive Survey of Large Language Models in Financial Applications

Parsing Old English with Universal Dependencies. The Impact of Model Architectures and Dataset Sizes

A Comparative Survey of Large Language Models: Foundation, Instruction-Tuned, and Multimodal Variants