Evaluating Layer-sharing in Transformers for Language and Reasoning Tasks

Renee Ge

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Transformers achieve state-of-the-art performance across language and reasoning tasks, but their inherent feed-forward nature limits their computational potential. Introducing recurrence through layer sharing is one solution to scale their computational capacity dynamically while reducing the number of parameters. In this work, we evaluate two approaches for applying layer sharing in transformers. First, we investigate whether we can achieve layer sharing by fine-tuning a pre-trained transformer. Specifically, we fine-tune a Large Language Model (LLM) Qwen 2.5 1.5B on a next-token prediction task, with similarity loss applied on the weights across layers to bring them closer together. We further experiment with expanding the width of the pre-trained model to compensate for the loss of capacity due to the weight-sharing constraint. While this approach leads to some reasonable performance, it does not fully solve the problem of converting a pre-trained feed-forward model into a recurrent one while maintaining its performance. Second, we study layer-shared models that are trained from scratch, focusing on the Hierarchical Reasoning Model (HRM), a novel and popular reasoning architecture that incorporates recurrent use of transformer modules. We find that with some even more recurrent designs, we can achieve the same level of performance while using as few as 25\% of the parameters. Together, these results are interesting explorations into the promising direction of converting feed-forward transformers into recurrent forms (via layer sharing), with the goal of achieving strong performance, lowering the number of parameters, and making progress towards more brain-like model architectures.

Version published to 10.31237/osf.io/y4bwj_v1 on OSF Preprints
Oct 11, 2025

Improving Large Language Models with Concept-Aware Fine-Tuning

This article has 5 authors:
1. Dacheng Tao
2. Michael Chen
3. Xikun ZHANG
4. Jiaxing Huang
5. Yingjie Wang
This article has no evaluationsLatest version Oct 1, 2025
From Transformer to Transponder: Introducing Contextual Modulation Training for Residual Learning in LLMs

This article has 5 authors:
1. Yingtao Zhang
2. Wenqi Gu
3. Wen Hu
4. Jianguo Li
5. Carlo Vittorio Cannistraci
This article has no evaluationsLatest version Sep 30, 2025
Learning to Retrieve, Generate, and Compress: A Unified View of Efficient RAG

This article has 4 authors:
1. Faruq Brontes
2. Jeanie Genesis
3. Zachariah Noa
4. Sigiwardaz Nymphodoros
This article has no evaluationsLatest version Aug 18, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Improving Large Language Models with Concept-Aware Fine-Tuning

From Transformer to Transponder: Introducing Contextual Modulation Training for Residual Learning in LLMs

Learning to Retrieve, Generate, and Compress: A Unified View of Efficient RAG