Improving Large Language Models with Concept-Aware Fine-Tuning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language models (LLMs) have become the cornerstone of modern AI. However, the current paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Specifically, an LLM will first decompose text into tokens, i.e., artificial text fragments. These tokens are then learned sequentially, rather than as part of a unified, coherent phrase or semantic entity 1 . This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems 2–4 . In response, we introduce Concept-Aware Fine-Tuning (CAFT), a multi-token training method that reshapes how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token fine-tuning. CAFT can be applied to diverse tasks, from traditional LLM tasks like coding to challenging scientific tasks involving domain-specific modalities like de novo protein design. CAFT successfully leverages the multi-token setting for fine-tuning, an approach previously considered impossible 2,5,6 , by introducing several technical innovations to address the inherent challenges of fine-tuning. Our results challenge the machine learning research community to rethink the ubiquitous next-token prediction paradigm and enable the broader scientific community to develop more powerful scientific LLMs involving domain-specific modalities 7 .

Article activity feed