Cross-Platform Inference of 1.58-bit State Space Models: ARM NEON vs x86 AVX-512 vs GPU CUD

Gabriel Zo-Hasina Rasatavohary

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large language model inference remains predominantly GPU-dependent, limiting deployment on edge devices and cost-sensitiveenvironments. We investigate whether State Space Models (SSMs) with extreme quantization can achieve practical inference speeds on CPU architectures without GPU acceleration. We benchmark BitMamba-2, a 1.58-bit ternary-quantized Mamba model, across five hardware configurations spanning three instruction set architectures: ARM NEON (Apple M1), x86 AVX-512 (Intel Xeon Silver 4210R), x86 AVX2 (Intel i9-10980HK), and GPU CUDA (NVIDIA RTX 2070 Super). Our C++ implementation features hand-written SIMD kernels for both ARM NEON and x86 AVX-512, enabling direct architectural comparison on the same model weights. On the 255M-parameter model, the Xeon AVX-512 configuration achieves 112.9 tokens/s and ARM NEON reaches 82.5 tokens/s — both exceeding the throughput of several cloud-hosted API endpoints. The 1B-parameter model runs at 46.8 tokens/s (Xeon) and 29.6 tokens/s (M1), competitive with Transformer models of equivalent weight size in 4-bit quantization. We experimentally confirm the O(1) memory property of SSM recurrence: throughput remains constant across sequence lengths from 50 to 200 tokens, in contrast to the linear KV-cache growth of Transformer architectures. We further quantify the WSL2 virtualization overhead at 10–25× relative to native execution on identical hardware. These results demonstrate that the combination of SSM recurrence and ternary quantization constitutes aviable mathematical reformulation for GPU-free inference at interactive speeds.

Version published to 10.31224/6686
Mar 25, 2026

State Space Models as CPU-Native Neural Network Architectures

This article has 1 author:
1. Gabriel Zo-Hasina Rasatavohary
This article has no evaluationsLatest version Mar 24, 2026
The Energy Efficiency Paradox: Lightweight CNNs Consume More Power than ResNets on Consumer GPUs

This article has 1 author:
1. Someyo Kamal Utsho
This article has no evaluationsLatest version Apr 14, 2026
I/O for LLM Inference: A Survey of Storage and Memory Bottlenecks

This article has 1 author:
1. Rajarshi Chowdhury
This article has no evaluationsLatest version Mar 19, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

State Space Models as CPU-Native Neural Network Architectures

The Energy Efficiency Paradox: Lightweight CNNs Consume More Power than ResNets on Consumer GPUs

I/O for LLM Inference: A Survey of Storage and Memory Bottlenecks