Reinforcement Learning for Large Language Model Fine-Tuning: A Systematic Literature Review

Lingxiao Kong
Qusai Ramdan
Oussama Zoubia
Jahid Hasan Polash
Mayra Elwes
Mehdi Akbari Gurabi
Lu Jin
Ekaterina Kutafina
Roman Matzutt
Yuanbin Wang
Junqi Xu
Oya Deniz Beyan
Cong Yang
Zeyd Boukhers

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) have been developed for a wide range of language-based tasks, while Reinforcement Learning (RL) has been primarily applied to decision-making problems such as robotics, game theory, and control systems. Nowadays, these two paradigms are integrated through different synergies. In this literature review, we focus on \textit{RL4LLM fine-tuning}, where RL techniques are systematically leveraged to fine-tune LLMs and align them with various preferences. Our review provides a comprehensive analysis of 230 recent publications, presenting a methodological taxonomy that organizes current research into three primary method domains: \textit{Optimization Algorithm}, concerning innovation in core RL update rules; \textit{Training Framework}, regarding innovation in the orchestration of the training process; and \textit{Reward Modeling}, addressing how LLMs learn and represent preferences and feedback. Within these primary domains, we further analyze methods and innovations through more granular categories to provide an in-depth summary of RL4LLM fine-tuning research. We address three research questions: 1) recent methods overview, 2) methodological innovations, and 3) limitations and future directions. Our analysis comprehensively demonstrates the breadth and impact of recent RL4LLM fine-tuning research while highlighting valuable directions for future investigation.

Version published to 10.21203/rs.3.rs-8196796/v1 on Research Square
Nov 27, 2025

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

This article has 5 authors:
1. Deepshikha Bhati
2. Fnu Neha
3. Devi Sri Bandaru
4. Matthew Weber
5. Ishan Dilipbhai Gajera
This article has no evaluationsLatest version Jan 15, 2026
Learning Evolving Latent Strategies for Multi-Agent Language Systems without Model Fine-Tuning

This article has 1 author:
1. Wenlong Tang
This article has no evaluationsLatest version Dec 24, 2025
DDPO: Diversity-Driven Preference Optimization for Machine Translation Enhancing Robustness and Generalization

This article has 2 authors:
1. Donald Martin
2. Blake Bowman
This article has no evaluationsLatest version Dec 30, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

Learning Evolving Latent Strategies for Multi-Agent Language Systems without Model Fine-Tuning

DDPO: Diversity-Driven Preference Optimization for Machine Translation Enhancing Robustness and Generalization