Transformer tricks: Removing weights for skipless transformers

Nils Graef

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

He and Hofmann [1] detailed a skipless transformer without the V and P (post-attention projection) linear layers, which reduces the total number of weights. However, this scheme is only applicable to MHA (multi-head attention), but not for MQA (multi-query attention) and GQA (grouped- query attention). The latter schemes are used by many popular LLMs such as Llama 2, Mistral, Mixtral, PaLM, and Gemma. Therefore, this micro-paper proposes mathematically equivalent versions that are suitable for MQA and GQA. For example, removing Q and P from a skipless version of Mistral-7B would remove 15% of its weights (and thus reduce its compute and memory complexity).

Version published to 10.31224/3629
Mar 24, 2024

CSH-256: A Modular Cubing–Based Approach toStrengthening the Critical Path in Hash Functions

This article has 1 author:
1. Ibrahem Aboukila
This article has no evaluationsLatest version Jan 7, 2026
A Continuation-Based Solution of the Linearity Challenge

This article has 2 authors:
1. Luca Padovani
2. Claudia Raffaelli
This article has no evaluationsLatest version Jan 21, 2026
Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs

This article has 5 authors:
1. Yinan Ni
2. Xiao Yang
3. Zhimin Qiu
4. Chen Wang
5. Tingzhou Yuan
This article has no evaluationsLatest version Dec 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

CSH-256: A Modular Cubing–Based Approach toStrengthening the Critical Path in Hash Functions

A Continuation-Based Solution of the Linearity Challenge

Predictive-LoRA: A Proactive and Fragmentation-Aware Serverless Inference System for LLMs