Per-Second, Explainable Obstructive Sleep Apnea Detection from Multimodal Time-Series using Vision Transformer

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Manual, second-by-second scoring of polysomnography (PSG) is the gold standard for diagnosing obstructive sleep apnea (OSA), yet it is time- and labor-intensive and prone to inter-scorer variability. Existing automated approaches analyze only ≤3 channels and skip second-level annotation, reporting instead the coarse Apnea-Hypopnea Index (AHI) and sacrificing clinical detail and transparency. We present VOSA, a Vision-Transformer (ViT)-based model that reproduces the technologist’s visual workflow: it ingests standardized PSG images containing all 21 biosignals, labels every second as normal, hypopnea, or apnea, computes AHI, and assigns four-level OSA severity while supplying attention heatmaps and calibrated confidence scores. Trained and evaluated on KISS, a PSG image dataset from 7,745 patients across four centers, VOSA achieved a per-second Macro F1 score of 82.6% and a severity Macro F1 score of 73.5%, placing 99.2% of patients in the correct or adjacent severity class. Testing on the public SHHS-2 dataset confirmed robust performance. Attention visualizations demonstrated VOSA’s alignment with AASM guidelines. Coupled with image-based sleep staging, VOSA marks the first attempt at fully automated generation of PSG reports and endotypic metrics, delivering an interpretable, scalable solution for precision sleep-medicine workflows.

Article activity feed