Architectures for Data-Aware LLMs: Models that Reason About Their Own Training Signal

Feng Chen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Current large language models (LLMs) are fundamentally data-blind: they inherit structure, biases, and gaps from their training corpora, yet they lack any explicit representation of where their knowledge came from, how reliable it is, or which parts of the world they have effectively “seen”. As a result, standard models struggle to answer basic epistemic questions such as: How confident should I be on this query, given my training signal? Which domains or populations are under-represented in my experience? Which new documents, experiments, or user interactions would most efficiently reduce my uncertainty? In this Perspective, we outline an emerging paradigm of data-aware LLMs that treat training data and learning history as first-class objects of computation. We propose architectural mechanisms for encoding data provenance, density, diversity, and conflict into persistent meta-representations that are accessible at inference time. These meta-layers enable models to expose calibrated uncertainty, surface data gaps, and condition their behavior on explicit epistemic state. We then discuss how data-aware models can drive active learning loops—proposing targeted data acquisitions, negotiating access with human and machine partners, and continuously updating their own meta-representations—to remain aligned with evolving domains and standards. Finally, we highlight applications to domain shift detection, robustness, and scientific discovery, and we analyze open challenges in privacy, governance, and standardization of data meta-layers. We argue that making models explicitly aware of their epistemic roots is a necessary next step toward trustworthy deployment in high-stakes scientific, industrial, and societal contexts.

Version published to 10.20944/preprints202512.0566.v1
Dec 8, 2025

EPMORE: Explainable Process Mixture-of-Experts

This article has 7 authors:
1. Wei Sheng
2. Chengzhu Xiao
3. Lunhao Ao
4. Junyan Long
5. Ye Yu
6. Yangguang Jia
7. Qihua Zhang
This article has no evaluationsLatest version Feb 5, 2026
Small Language Models: Architecture, Evolution, and the Future of Artificial Intelligence

This article has 5 authors:
1. Ankit Parag Shah
2. Mohammad-Parsa Hosseini
3. Su Min Park
4. Connie Miao
5. Wei Wei
This article has no evaluationsLatest version Jan 13, 2026
Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

This article has 5 authors:
1. Deepshikha Bhati
2. Fnu Neha
3. Devi Sri Bandaru
4. Matthew Weber
5. Ishan Dilipbhai Gajera
This article has no evaluationsLatest version Jan 15, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

EPMORE: Explainable Process Mixture-of-Experts

Small Language Models: Architecture, Evolution, and the Future of Artificial Intelligence

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods