Beyond the Hype? A Standardised Real-World Evaluation of Consumer Sleep Trackers (CST) in Extracting Sleep
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The rapid expansion of consumer sleep trackers (CST) in clinical and basic research has intensified the need for standardised performance evaluation frameworks capable of examining their limitations and capabilities against polysomnography (PSG). In this study, we systematically compared eight CST: sleep² with Polar Verity Sense (VS) and Polar H10, Oura Ring 3, Apple Watch Series 9, Fitbit Charge 6, Garmin Vivoactive 6 & Venu 3, WHOOP 4, and Circul+, against home-based ambulatory PSG. Eighteen participants completed five consecutive nights (Monday to Friday) of home PSG while wearing two paired devices of each CST. Controlled sleep manipulations, such as restricted and extended sleep, were included in the protocol to challenge the devices’ algorithms under varied levels of difficulty. Performance evaluation followed a standardised framework incorporating both epoch-by-epoch and discrepancy analyses to assess multi-class sleep-staging performance and agreement across key sleep parameters. Epoch-by-epoch accuracy and Cohen’s κ were as follows: sleep² -Polar VS = 83.7% (κ = 0.76), sleep²-Polar H10 = 84.04% (κ = 0.76), Oura Ring 3 = 72.5% (κ = 0.59), Apple Watch Series 9 = 72.3% (κ = 0.56), Fitbit Charge 6 = 66.16% (κ = 0.47), Garmin Vivoactive 6 & Venu 3 = 63.4% (κ = 0.41), WHOOP 4 = 65.15% (κ = 0.48), and Circul+ = 55.59% (κ = 0.33). Discrepancy analysis revealed systematic and proportional biases across CST devices: most wrist-worn devices overestimated sleep time and underestimated wake after sleep onset, with error magnitude increasing drastically in atypical nights characterised by fragmented, restricted, or extended sleep. Cardiac-derived arm- and chest-band sleep² derived sleep metrics showed minor deviation from PSG, whereas Oura devices exhibited moderate but variable levels of bias. Overall, these findings reveal substantial variability across CST and highlight the importance of standardised bench marking to support the scientific and clinical use of wearable-derived sleep macrostructure.