Entropy-Driven Gradient Stability in Large Language Models: A Non-Equilibrium Thermodynamic Framework for Reinforcement Learning Optimization
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
T he optimization landscape of Large Language Models ( LLMs ) with extremely high parameter counts exhibits chaotic and unstable dynamics, particularly during reinforcement learning fine-tuning stages where sparse and heavy-tailed reward signals dominate. Existing approaches, such as Proximal Policy Optimization ( PPO ), rely on heuristic clipping mechanisms that impose rigid trust regions, often leading to gradient turbulence, mode collapse, and catastrophic updates. In this work, we introduce Thermodynamic Variational Optimization ( TVO ), a physics-informed framework that reformulates LLM optimization as a non-equilibrium thermodynamic process on a statistical manifold. By defining a Helmholtz free energy functional that balances reward maximization with entropy-driven dissipation, we derive a dissipative gradient flow that enforces monotonic stability without resorting to second-order curvature inversion. TVO introduces a dynamic viscosity term governed by a binary approximation of Total Variation divergence, enabling efficient, scalable control of gradient fluctuations with constant-time complexity relative to vocabulary size. We provide theoretical guarantees of stability using Lyapunov analysis and validate the framework empirically on challenging mathematical reasoning benchmarks, including MATH and AIME24. Experimental results demonstrate substantial reductions in gradient variance, elimination of training collapse, and significant improvements in sample efficiency compared to state-of-the-art proximal optimization baselines. This work positions thermodynamic principles as a foundational lens for understanding and stabilizing large-scale model optimization, offering a unifying framework that bridges reinforcement learning, information geometry, and non-equilibrium physics.