State Space Models as CPU-Native Neural Network Architectures
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We present experimental evidence that State Space Models (SSMs) are structurally advantageous for neural network inference on CPU and ARM architectures. By porting BitMamba-2, a 1.58-bit quantized Mamba implementation, from x86 AVX2 to ARM NEON, we achieve 82.5 tokens/sec (255M parameters) and 29.6 tokens/sec (1B parameters) on an Apple M1 processor — the first published ARM benchmark for this model family. We experimentally validate the O(1) memory property of SSMs: generation speed remains perfectly constant across sequence lengths from 50 to 200+ tokens, in contrast to Transformer-based models whose memory grows linearly with context via the KV cache. At comparable model weight sizes (~600 MB), the SSM achieves throughput competitive with quantized Transformers (~30–40 tokens/sec) while offering constant memory footprint and 1.58-bit compression (vs. 4-bit for Transformers). These results support the thesis that mathematical reformulations — here, the combination of state space recurrence with ternary quantization — can make non-GPU inference structurally competitive rather than merely tolerable.