On-Device Multi-Type Disfluency Detection with Sub-Millisecond Inference on Apple Silicon
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Published multi-type disfluency detection systems achieve their best results with 300M+ parameter server-class backbones, leaving speech-therapy applications without a concrete reference for the detection performance and inference latency achievable on a smartphone. We present DisfluoSDK, a multi-type disfluency classifier running entirely on-device on Apple Silicon. On SEP-28K (20,131 clips, episode-grouped 5-fold cross-validation) a 617K-parameter CNN achieves macro-F1 0.382 (1.2 MB CoreML) and an adapted ResNet-18 achieves 0.404 (11.2M parameters, 21 MB)—occupying an otherwise unpopulated region of the accuracy–efficiency Pareto frontier where on-device deployment is feasible. A four-way CoreML compute-unit sweep across four hardware generations (M1 Max, A19 Pro, A18, A15; 16,000+ timed trials) shows that the Neural Engine delivers sub-millisecond mean inference across all tested devices (CNN 0.225–0.635 ms), providing ample real-time headroom for speech processing. The sweep also surfaces a desktop/mobile CoreML scheduler divergence in GPU routing with a direct consequence for deployment practice. PyTorch-to-CoreML export fidelity is numerically verified on 500 test-fold spectrograms (cell-level agreement 99.96%/100.00%, ΔF1 ≤ 0.003). As an auxiliary empirical result, voice-stress features show no practically meaningful linear association with any disfluency type across 14,645 clips (|r| < 0.05, all Cohen-negligible), supporting the architectural separation of stress and disfluency modules.