On-Device Multi-Type Disfluency Detection with Sub-Millisecond Inference on Apple Silicon

Nazar Kozak

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Published multi-type disfluency detection systems achieve their best results with 300M+ parameter server-class backbones, leaving speech-therapy applications without a concrete reference for the detection performance and inference latency achievable on a smartphone. We present DisfluoSDK, a multi-type disfluency classifier running entirely on-device on Apple Silicon. On SEP-28K (20,131 clips, episode-grouped 5-fold cross-validation) a 617K-parameter CNN achieves macro-F1 0.382 (1.2 MB CoreML) and an adapted ResNet-18 achieves 0.404 (11.2M parameters, 21 MB)—occupying an otherwise unpopulated region of the accuracy–efficiency Pareto frontier where on-device deployment is feasible. A four-way CoreML compute-unit sweep across four hardware generations (M1 Max, A19 Pro, A18, A15; 16,000+ timed trials) shows that the Neural Engine delivers sub-millisecond mean inference across all tested devices (CNN 0.225–0.635 ms), providing ample real-time headroom for speech processing. The sweep also surfaces a desktop/mobile CoreML scheduler divergence in GPU routing with a direct consequence for deployment practice. PyTorch-to-CoreML export fidelity is numerically verified on 500 test-fold spectrograms (cell-level agreement 99.96%/100.00%, ΔF1 ≤ 0.003). As an auxiliary empirical result, voice-stress features show no practically meaningful linear association with any disfluency type across 14,645 clips (|r| < 0.05, all Cohen-negligible), supporting the architectural separation of stress and disfluency modules.

Version published to 10.31224/6814
Apr 14, 2026

CascadeNet: A Two-Stage Hybrid Learning Framework for Explainable Deepfake Forensics

This article has 4 authors:
1. Jatin Yadav
2. Divyanshu Sharma
3. Pritee Khanna
4. Neha Gour
This article has no evaluationsLatest version Apr 13, 2026
TTV-HRM: Hierarchical Reasoning Architecture for Efficient Text-to-Video Generation

This article has 1 author:
1. Ahsan Umar
This article has no evaluationsLatest version Mar 23, 2026
High-Performance Phishing Email Detection Using Hybrid Machine Learning and Deep Learning Approaches

This article has 3 authors:
1. Mohamed Khayati
2. Driss Ait Omar
3. Mohamed Baslam
This article has no evaluationsLatest version Apr 7, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

CascadeNet: A Two-Stage Hybrid Learning Framework for Explainable Deepfake Forensics

TTV-HRM: Hierarchical Reasoning Architecture for Efficient Text-to-Video Generation

High-Performance Phishing Email Detection Using Hybrid Machine Learning and Deep Learning Approaches