Cross-Platform Inference of 1.58-bit State Space Models: ARM NEON vs x86 AVX-512 vs GPU CUD

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Large language model inference remains predominantly GPU-dependent, limiting deployment on edge devices and cost-sensitiveenvironments. We investigate whether State Space Models (SSMs) with extreme quantization can achieve practical inference speeds on CPU architectures without GPU acceleration. We benchmark BitMamba-2, a 1.58-bit ternary-quantized Mamba model, across five hardware configurations spanning three instruction set architectures: ARM NEON (Apple M1), x86 AVX-512 (Intel Xeon Silver 4210R), x86 AVX2 (Intel i9-10980HK), and GPU CUDA (NVIDIA RTX 2070 Super). Our C++ implementation features hand-written SIMD kernels for both ARM NEON and x86 AVX-512, enabling direct architectural comparison on the same model weights. On the 255M-parameter model, the Xeon AVX-512 configuration achieves 112.9 tokens/s and ARM NEON reaches 82.5 tokens/s — both exceeding the throughput of several cloud-hosted API endpoints. The 1B-parameter model runs at 46.8 tokens/s (Xeon) and 29.6 tokens/s (M1), competitive with Transformer models of equivalent weight size in 4-bit quantization. We experimentally confirm the O(1) memory property of SSM recurrence: throughput remains constant across sequence lengths from 50 to 200 tokens, in contrast to the linear KV-cache growth of Transformer architectures. We further quantify the WSL2 virtualization overhead at 10–25× relative to native execution on identical hardware. These results demonstrate that the combination of SSM recurrence and ternary quantization constitutes aviable mathematical reformulation for GPU-free inference at interactive speeds.

Article activity feed