Beyond Accuracy: A Framework for Evaluating Algorithmic Bias and Performance, Applied to Automated Sleep Scoring

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recent advancements in artificial intelligence (AI) have significantly improved sleep-scoring algorithms, bringing their performance close to the theoretical limit of approximately 80%, which aligns with inter-scorer agreement levels. While this suggests the problem is technically solved, clinical adoption remains challenging due to ethical and regulatory requirements for rigorous validation, fairness, and human oversight. Existing validation methods, such as Bland-Altman analysis, often rely on simple correlation metrics, overlooking potential non-linear influences of external factors (e.g., demographic or clinical variables) on systematic predictive errors (biases) in derived clinical markers. Additionally, performance metrics are typically reported as the mean of on-subject results, neglecting critical scenarios—such as different quantiles—that could better convey the algorithm’s capabilities and limitations to clinicians as end-users.To address this gap, we propose a universal framework for quantifying both performance metrics and biases in predictive algorithmic tools. Our approach extends conventional validation methods by analyzing how external factors shape the entire distribution of predictive performance and errors, rather than just the expected mean. Applying it to the widely recognized U-Sleep and YASA sleep-scoring algorithms, we identify biases—such as age-related shifts—indicating missing input information or imbalances in training data. Despite these biases, we illustrate that both algorithms maintain non-inferior performance in the risk assessment of sleep apnea based on prediction-derived markers, highlighting the potential and clinical utility of algorithmic insights.

Article activity feed