Per-Second, Explainable Obstructive Sleep Apnea Detection from Multimodal Time-Series using Vision Transformer
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Manual, second-by-second scoring of polysomnography (PSG) is the gold standard for diagnosing obstructive sleep apnea (OSA), yet it is time- and labor-intensive and prone to inter-scorer variability. Existing automated approaches analyze only ≤3 channels and skip second-level annotation, reporting instead the coarse Apnea-Hypopnea Index (AHI) and sacrificing clinical detail and transparency. We present VOSA, a Vision-Transformer (ViT)-based model that reproduces the technologist’s visual workflow: it ingests standardized PSG images containing all 21 biosignals, labels every second as normal, hypopnea, or apnea, computes AHI, and assigns four-level OSA severity while supplying attention heatmaps and calibrated confidence scores. Trained and evaluated on KISS, a PSG image dataset from 7,745 patients across four centers, VOSA achieved a per-second Macro F1 score of 82.6% and a severity Macro F1 score of 73.5%, placing 99.2% of patients in the correct or adjacent severity class. Testing on the public SHHS-2 dataset confirmed robust performance. Attention visualizations demonstrated VOSA’s alignment with AASM guidelines. Coupled with image-based sleep staging, VOSA marks the first attempt at fully automated generation of PSG reports and endotypic metrics, delivering an interpretable, scalable solution for precision sleep-medicine workflows.