Improving Large Language Models with Concept-Aware Fine-Tuning

Dacheng Tao
Michael Chen
Xikun ZHANG
Jiaxing Huang
Yingjie Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language models (LLMs) have become the cornerstone of modern AI. However, the current paradigm of next-token prediction fundamentally limits their ability to form coherent, high-level concepts, making it a critical barrier to human-like understanding and reasoning. Specifically, an LLM will first decompose text into tokens, i.e., artificial text fragments. These tokens are then learned sequentially, rather than as part of a unified, coherent phrase or semantic entity ¹ . This fragmented representation hinders deeper conceptual understanding and, ultimately, the development of truly intelligent systems ^2–4 . In response, we introduce Concept-Aware Fine-Tuning (CAFT), a multi-token training method that reshapes how LLMs are fine-tuned. By enabling the learning of sequences that span multiple tokens, this method fosters stronger concept-aware learning. Our experiments demonstrate significant improvements compared to conventional next-token fine-tuning. CAFT can be applied to diverse tasks, from traditional LLM tasks like coding to challenging scientific tasks involving domain-specific modalities like de novo protein design. CAFT successfully leverages the multi-token setting for fine-tuning, an approach previously considered impossible ^2,5,6 , by introducing several technical innovations to address the inherent challenges of fine-tuning. Our results challenge the machine learning research community to rethink the ubiquitous next-token prediction paradigm and enable the broader scientific community to develop more powerful scientific LLMs involving domain-specific modalities ⁷ .

Version published to 10.21203/rs.3.rs-7391246/v1 on Research Square
Oct 1, 2025

Morphological-Core Tokenization: A Novel Approach to Preserve Semantic Integrity in Large Language Models

This article has 1 author:
1. Hemanth Manchabale Papachappa
This article has no evaluationsLatest version Oct 23, 2025
CLGRPO: Reasoning Ability Enhancement for Small VLMs

This article has 7 authors:
1. Fanyi Wang
2. Bingzhi Dong
3. Weijie Zou
4. Haotian Hu
5. Jinjin Xu
6. Chongyang Wang
7. Zhiwang Zhang
This article has no evaluationsLatest version Oct 24, 2025
Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion

This article has 1 author:
1. Rakshit Dabral
This article has no evaluationsLatest version Oct 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Morphological-Core Tokenization: A Novel Approach to Preserve Semantic Integrity in Large Language Models

CLGRPO: Reasoning Ability Enhancement for Small VLMs

Accelerating Small Language Model via Quantization: A GPT-4 Guided Approach for Low-Resource Story Completion