Evaluating Layer-sharing in Transformers for Language and Reasoning Tasks
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Transformers achieve state-of-the-art performance across language and reasoning tasks, but their inherent feed-forward nature limits their computational potential. Introducing recurrence through layer sharing is one solution to scale their computational capacity dynamically while reducing the number of parameters. In this work, we evaluate two approaches for applying layer sharing in transformers. First, we investigate whether we can achieve layer sharing by fine-tuning a pre-trained transformer. Specifically, we fine-tune a Large Language Model (LLM) Qwen 2.5 1.5B on a next-token prediction task, with similarity loss applied on the weights across layers to bring them closer together. We further experiment with expanding the width of the pre-trained model to compensate for the loss of capacity due to the weight-sharing constraint. While this approach leads to some reasonable performance, it does not fully solve the problem of converting a pre-trained feed-forward model into a recurrent one while maintaining its performance. Second, we study layer-shared models that are trained from scratch, focusing on the Hierarchical Reasoning Model (HRM), a novel and popular reasoning architecture that incorporates recurrent use of transformer modules. We find that with some even more recurrent designs, we can achieve the same level of performance while using as few as 25\% of the parameters. Together, these results are interesting explorations into the promising direction of converting feed-forward transformers into recurrent forms (via layer sharing), with the goal of achieving strong performance, lowering the number of parameters, and making progress towards more brain-like model architectures.