Beyond the Hype? A Standardised Real-World Evaluation of Consumer Sleep Trackers (CST) in Extracting Sleep

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rapid expansion of consumer sleep trackers (CST) in clinical and basic research has intensified the need for standardized performance evaluation frameworks capable of examining their limitations and capabilities against polysomnography (PSG). In this study, we systematically compared eight CST: sleep² withPolar Verity Sense (VS) and Polar H10, Oura Ring 3, Apple Watch Series 9, Fitbit Charge 6, Garmin Vivoactive 5 & Venu 3, WHOOP 4, and Circul+, againsthome-based ambulatory PSG. Eighteen participants completed five consecutive nights (Monday to Friday) of home PSG while wearing two paired devices of each CST. Controlled sleep disruptions, as well as restricted and extended sleep, were included in the protocol to challenge the devices’ algorithms undervaried conditions. Performance evaluation followed a standardised framework incorporating both epoch-by-epoch and discrepancy analyses to assess multi-class sleep-staging performance and agreement across key sleep parameters. Epoch-by-epoch accuracy and Cohen’s κ were as follows:sleep²-Polar VS 83.7% (κ =0.76), sleep²-Polar H10 84.04% (κ = 0.76), Oura Ring 3 72.5% (κ = 0.59), AppleWatch Series 9 72.3% (κ = 0.56), WHOOP 4 68.68% (κ = 0.54), Fitbit Charge 6 66.16% (κ = 0.47), Garmin Vivoactive 5 & Venu 3 63.4% (κ = 0.41), and Circul+ 55.59% (κ = 0.33). Discrepancy analysis revealed systematic and proportional biases across CST: most wrist-worn devices overestimated sleep time and underestimated wake after sleep onset, with error magnitude increasing drastically in atypical nights characterised by fragmented, restricted, or extended sleep. Arm- and chest-band sleep² derived sleep metrics, based on cardiac-signals, showed minor deviation from PSG, whereas Oura exhibited moderate but variable levels of bias. Overall, these findings reveal substantial variability across CST and highlight the importance of standardised benchmarking to support the scientificand clinical use of wearable-derived sleep macrostructure. JEL Classification: D8 , H51 MSC Classification: 35A01 , 65L10

Article activity feed