Meta-Thinking in LLMs via Multi-Agent Reinforcement Learning: A Survey

Ahsan Bilal
Muhammad Ahmed Mohsin
Muhammad Umer
Muhammad Awais Khan Bangash
Muhammad Ali Jamshed

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This survey explores the development of meta thinking capabilities in Large Language Models (LLMs) from a Multi Agent Reinforcement Learning (MARL) perspective. Meta thinking refers to self reflection, self assessment, and regulation of internal reasoning processes, and represents a crucial step toward improving LLM reliability, adaptability, and performance, particularly in complex or high stakes settings. The survey begins by examining current limitations of LLMs, including hallucinations and the absence of robust internal self evaluation mechanisms. It then reviews contemporary approaches such as Reinforcement Learning from Human Feedback (RLHF), self distillation, and Chain of Thought (CoT) prompting, highlighting both their contributions and limitations. The core focus of the survey is on multi agent architectures such as supervisor agent hierarchies, debate based systems, and theory of mind frameworks that emulate human like introspection and enhance robustness. By analyzing reward design, self play dynamics, and continuous learning strategies within MARL, the survey presents a structured roadmap for developing introspective, adaptive, and trustworthy LLM systems. It also discusses evaluation metrics, benchmark datasets, and future research directions, including neuroscience inspired designs and hybrid symbolic neural reasoning frameworks.

Version published to 10.21203/rs.3.rs-8994957/v1 on Research Square
Mar 5, 2026

Large Language Models for Reinforcement Learning: A Survey of Intervention Operators and Optimization Effects

This article has 3 authors:
1. Kourosh Shahnazari
2. Seyed Moein Ayyoubzadeh
3. Mohammadali Keshtparvar
This article has no evaluationsLatest version Mar 3, 2026
Codette: Multi-Perspective Reasoning as aConvergentDynamical System with Meta-Cognitive Strategy Evolution

This article has 1 author:
1. jonathan harrison
This article has no evaluationsLatest version Apr 10, 2026
Active Inference: An Encompassing Theory of Learning

This article has 1 author:
1. Olivier Chabot
This article has no evaluationsLatest version Feb 23, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large Language Models for Reinforcement Learning: A Survey of Intervention Operators and Optimization Effects

Codette: Multi-Perspective Reasoning as aConvergentDynamical System with Meta-Cognitive Strategy Evolution

Active Inference: An Encompassing Theory of Learning