Entropy-Driven Gradient Stability in Large Language Models: A Non-Equilibrium Thermodynamic Framework for Reinforcement Learning Optimization

Abdessamad Bourkibate

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

T he optimization landscape of Large Language Models ( LLMs ) with extremely high parameter counts exhibits chaotic and unstable dynamics, particularly during reinforcement learning fine-tuning stages where sparse and heavy-tailed reward signals dominate. Existing approaches, such as Proximal Policy Optimization ( PPO ), rely on heuristic clipping mechanisms that impose rigid trust regions, often leading to gradient turbulence, mode collapse, and catastrophic updates. In this work, we introduce Thermodynamic Variational Optimization ( TVO ), a physics-informed framework that reformulates LLM optimization as a non-equilibrium thermodynamic process on a statistical manifold. By defining a Helmholtz free energy functional that balances reward maximization with entropy-driven dissipation, we derive a dissipative gradient flow that enforces monotonic stability without resorting to second-order curvature inversion. TVO introduces a dynamic viscosity term governed by a binary approximation of Total Variation divergence, enabling efficient, scalable control of gradient fluctuations with constant-time complexity relative to vocabulary size. We provide theoretical guarantees of stability using Lyapunov analysis and validate the framework empirically on challenging mathematical reasoning benchmarks, including MATH and AIME24. Experimental results demonstrate substantial reductions in gradient variance, elimination of training collapse, and significant improvements in sample efficiency compared to state-of-the-art proximal optimization baselines. This work positions thermodynamic principles as a foundational lens for understanding and stabilizing large-scale model optimization, offering a unifying framework that bridges reinforcement learning, information geometry, and non-equilibrium physics.

Version published to 10.21203/rs.3.rs-8812364/v1 on Research Square
Feb 10, 2026

Thermodynamic Natural Gradient Descent (NGD-T): Regulating Natural-Gradient Steps by a Geometric Speed–Cost Bound

This article has 1 author:
1. Barco You
This article has no evaluationsLatest version Jan 21, 2026
Free-Energy Geometry and Dynamical Regime Switching under Feasibility Constraints A Transferable Dynamical Motif for Nonequilibrium Systems

This article has 1 author:
1. Toshio SATO
This article has no evaluationsLatest version Mar 16, 2026
Representation Mechanics: Invariant-Governed Learning Dynamics

This article has 1 author:
1. Datorien L. Anderson
This article has no evaluationsLatest version Feb 10, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Thermodynamic Natural Gradient Descent (NGD-T): Regulating Natural-Gradient Steps by a Geometric Speed–Cost Bound

Free-Energy Geometry and Dynamical Regime Switching under Feasibility Constraints A Transferable Dynamical Motif for Nonequilibrium Systems

Representation Mechanics: Invariant-Governed Learning Dynamics