Architectures for Data-Aware LLMs: Models that Reason About Their Own Training Signal
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Current large language models (LLMs) are fundamentally data-blind: they inherit structure, biases, and gaps from their training corpora, yet they lack any explicit representation of where their knowledge came from, how reliable it is, or which parts of the world they have effectively “seen”. As a result, standard models struggle to answer basic epistemic questions such as: How confident should I be on this query, given my training signal? Which domains or populations are under-represented in my experience? Which new documents, experiments, or user interactions would most efficiently reduce my uncertainty? In this Perspective, we outline an emerging paradigm of data-aware LLMs that treat training data and learning history as first-class objects of computation. We propose architectural mechanisms for encoding data provenance, density, diversity, and conflict into persistent meta-representations that are accessible at inference time. These meta-layers enable models to expose calibrated uncertainty, surface data gaps, and condition their behavior on explicit epistemic state. We then discuss how data-aware models can drive active learning loops—proposing targeted data acquisitions, negotiating access with human and machine partners, and continuously updating their own meta-representations—to remain aligned with evolving domains and standards. Finally, we highlight applications to domain shift detection, robustness, and scientific discovery, and we analyze open challenges in privacy, governance, and standardization of data meta-layers. We argue that making models explicitly aware of their epistemic roots is a necessary next step toward trustworthy deployment in high-stakes scientific, industrial, and societal contexts.