AERO: Softmax-Only LLMs for Efficient Private Inference

Nandan Kumar Jha
Brandon Reagen

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The pervasiveness of proprietary language models has raised privacy concerns for users’ sensitive data, emphasizing the need for private inference (PI), where inference is performed directly on encrypted inputs. However, current PI methods face prohibitively higher communication and latency overheads, primarily due to nonlinear operations. In this paper, we present a comprehensive analysis to understand the role of nonlinearities in transformer-based decoder-only language models. We introduce AERO, a four-step architectural optimization framework that refines the existing LLM architecture for efficient PI by systematically removing nonlinearities such as LayerNorm and GELU and reducing FLOPs counts. For the first time, we propose a Softmax-only architecture with significantly fewer FLOPs tailored for efficient PI. Furthermore, we devise a novel entropy regularization technique to improve the performance of Softmax-only models. AERO achieves up to 4.23\(\times\) communication and 1.94\(\times\) latency reduction. We validate the effectiveness of AERO by benchmarking it against the state-of-the-art.

Version published to 10.32388/wwlt24.2
Dec 6, 2024
Version published to 10.32388/wwlt24
Oct 27, 2024

Democratiassess: Edge Native Quantized Ai for Equitable, Offline Adaptive Exams in Resource-constrained Universities

This article has 1 author:
1. Behailu Wolde
This article has no evaluationsLatest version Jan 22, 2026
CSH-256: A Modular Cubing–Based Approach toStrengthening the Critical Path in Hash Functions

This article has 1 author:
1. Ibrahem Aboukila
This article has no evaluationsLatest version Jan 7, 2026
Real-Time and Offline Large Language Models on Edge Devices: A Systematic Review

This article has 2 authors:
1. Erçin Dinçer
2. Zeynep Hilal Kilimci
This article has no evaluationsLatest version Dec 26, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Democratiassess: Edge Native Quantized Ai for Equitable, Offline Adaptive Exams in Resource-constrained Universities

CSH-256: A Modular Cubing–Based Approach toStrengthening the Critical Path in Hash Functions

Real-Time and Offline Large Language Models on Edge Devices: A Systematic Review