An open-source, high-performance tool for automated sleep staging

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    This study describes a novel algorithm to perform fully automated sleep staging in humans. It is well validated and performs at the level of other state of the art algorithms although it does not yet include comparisons against the existing tools available in the field. Given the efforts made by the authors to ensure ease of use and accessibility it may help extend the use of automated methods in the study of sleep.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

The clinical and societal measurement of human sleep has increased exponentially in recent years. However, unlike other fields of medical analysis that have become highly automated, basic and clinical sleep research still relies on human visual scoring. Such human-based evaluations are time-consuming, tedious, and can be prone to subjective bias. Here, we describe a novel algorithm trained and validated on +30,000 hr of polysomnographic sleep recordings across heterogeneous populations around the world. This tool offers high sleep-staging accuracy that matches human scoring accuracy and interscorer agreement no matter the population kind. The software is designed to be especially easy to use, computationally low-demanding, open source, and free. Our hope is that this software facilitates the broad adoption of an industry-standard automated sleep staging software package.

Article activity feed

  1. Author Response:

    Reviewer #2 (Public Review):

    1. The authors describe their algorithm as a tool that (i) was validated "across heterogeneous populations around the world"; (ii) has an "accuracy matching or exceeding human accuracy"; (iii) "is easy to use". I take issue with these three statements. First, the authors did not test the performance of their algorithm in clinical populations with sleep disorders, despite the fact that individuals with sleep disorders represent (logically) the vast majority of sleep recordings. Crucially, such a comparison was made in the best (to my knowledge) published automated sleep staging algorithm (Stephansen et al. Nature Communications 2018, doi: 10.1038/s41467-018-07229-3). The omission of this work is very surprising. Quantifying the impact of sleep disorders on a sleep scoring algorithm is critical for its deployment in sleep clinics.

    We apologize as we were not clear in describing our training and testing data set. Indeed, both the training and testing set 1 included a significant number of individuals with sleep disorders. Indeed, about 30% of the individuals had moderate to severe sleep apnea (AHI >= 15). The validation dataset (DOD, or testing set 2) also includes 55 nights from individuals with obstructive sleep apnea (average AHI = 18.5 ± 16.2). Furthermore, both the training and testing set 1 included individuals with a medical diagnosis of insomnia, depression, diabete and hypertension.

    The health status and demographics data of the training and testing sets have now been clarified throughout in the manuscript to avoid any such confusion:

    1. Methods: We have added an extensive description of each dataset in the training and testing sets, including data on health and sleep disorders.

    2. Results: We have added a new table to report and compare demographics/health data of the training and testing set, as suggested in a later comment by the reviewer.

    3. Results: Performance results of the testing set 2 are now reported separately for healthy individuals and individuals with sleep disorders.

    Second, the authors wrote that their algorithm is "matching or exceeding" human accuracy but seem to present uncorrected one-to-one comparisons to support their claim. The fact that an algorithm is better than some humans do not mean it exceeds human performance.

    Thanks for noting that. We have now removed all instances of “exceeding human accuracy”.

    Third, although I agree that the tool seems easy to use even for individuals with limited programming skills, it still requires some. I don't think someone who is used to software with graphical interfaces and who has never used (or heard of!) python would describe the tool as easy to use. This poses an important implementation challenge.

    1. An important limitation of this algorithm is that it captures only one part of the visual examination of sleep data. Indeed, especially in clinical settings, the data is not only examined to establish the hypnogram but to also identify markers of common sleep disorders (e.g. sleep apnea, leg movements, etc). Although this algorithm could significantly speed up sleep scoring, it does not allow to detect these other important markers. Currently, and in link with the previous comment, the algorithm could not replace the visual inspection of the data for clinical diagnoses.

    We have now revised the manuscript such that we discussed in this possibility in “Limitations and future directions” subsection of the new Discussion:

    “The algorithm is not currently able to identify markers of common sleep disorders (such as sleep apnea, leg movements) and as such may not be suited for clinical purposes. It should be noted however that our software does include several other functions to quantify phasic events during sleep (slow-waves, spindles, REMs, artefacts) as well as sleep fragmentation of the hypnogram. Rather than replacing the crucial expertise of clinicians, YASA may thus provide a helpful starting point to accelerate clinical scoring of polysomnography recordings. Furthermore, future developments of the algorithm should prioritize automated scoring of clinical disorders, in particular apnea-hypopnea events. On the latter, YASA could implement some of the algorithms that have been developed over the last few years to detect apnea-hypopnea events from the ECG or respiratory channels (e.g. Varon et al. 2015; Koley and Dey 2013).”

    1. The data were curated with some recordings or portions of recordings being excluded (see p. 7). While I understand that this curation is important for the training set, I think it should not be applied to the test set. Indeed, it goes contrary to the logic of automating sleep staging. For example, cutting the beginning and end of the recording according to sleep start and end (p. 7) supposes that the start and end of sleep are already known (i.e. it has already been scored).

    This truncation step has now been removed from the pipeline and all the results have been updated accordingly. In addition, we have also removed all other exclusion criteria (e.g. PSG data quality, recording duration, etc) to improve the generalization power of the algorithm, thanks to the suggestions of the reviewer.

    1. Two types of EEG derivations were used (C4-M1 or C4-Fpz). Was the performance impacted by this variable? Is it fair to assume that the choice of features (spectral features or summary statistics of time series data) could explain the absence of differences but that introducing new features (i.e. phase-sensitive features) could increase the influence of the choice of the derivation?

    Thanks for raising this. First, our choice of the EEG reference was determined by the datasets: the CFS, CCSHS, MrOS, CHAT and HomePAP datasets were all referenced to Fpz, while the MESA, SHHS and DOD datasets were referenced to the contralateral mastoid. The montage of each dataset has now been added to the Methods section.

    Second, as rightly pointed out by the reviewer, the features implemented in the algorithm were chosen to be robust to various recording montages. This is now explicitly discussed in the “Features extraction” subsection of the Methods:

    “The features included in the current algorithm were chosen to be robust to different recording montages. As such, we did not include features that are dependent on the phase of the signal, and/or that require specific events detection (e.g. slow-waves, rapid eye movements). However, the time-domain features are dependent upon the amplitude of the signal, and the algorithm may fail if the input data is not expressed in standard units (uV) or has been z-scored prior to applying the automatic sleep staging.”

    1. Given that markers of sleep stages are very different in EOG, EMG and EEG time series, could the authors explain the logic behind applying the same pre-processing and extracting the same features on these three very different types of data? Could this explain why the majority of the features in the top-20 features were EEG features?

    We now provide a more detailed explanation on the inclusion of EOG and EMG features in the “Features extraction” subsection of the Methods:

    “These features were selected based on prior work in features-based classification algorithms for automatic sleep staging (Krakovská and Mezeiová 2011; Lajnef et al. 2015; Sun et al. 2017). For example, it was previously reported that the permutation entropy of the EOG/EMG as well as the EEG spectral powers in the traditional frequency bands are the most important features for accurate sleep staging (Lajnef et al. 2015), thus warranting their inclusion in the current algorithm. Several other features are derived from the authors’ previous works with entropy/fractal dimension metrics1. ” https://github.com/raphaelvallat/antropy

    Furthermore, we have added a “Limitations and future directions” section in the Discussion in which we propose future improvements of the algorithm. One of these potential improvements is the development of EOG and EMG features that would provide a higher discrimination of the sleep stages:

    “This suggests that one way to improve performance on this population could be the inclusion of more EEG channels and/or bilateral EOGs. For instance, using the negative product of bilateral EOGs may increase sensitivity to rapid eye movements in REM sleep or slow eye movements in N1 sleep (Stephansen et al. 2018; Agarwal et al. 2005). Interestingly, the Perslev 2021 algorithm does not use an EMG channel, which is consistent with our observation of a negligible benefit on accuracy when adding EMG to the model. This may also indicate that while the current set of features implemented in the algorithm performs well for EEG and EOG channels, it does not fully capture the meaningful dynamic information nested within muscle activity during sleep.”

    1. Sleep scoring guidelines incorporate not only what can be observed on a given epoch of data but also what is observed in the previous epoch(s). For example, an epoch can be scored as N2 even if there is no marker of N2 but there was (i) a marker of N2 in a previous epoch, (ii) no reason to change the score since. To reproduce this, the authors employed a symmetrical smoothing approach (a combination of a triangular-weighted rolling average and asymmetrical rolling average). Why did the authors choose to incorporate data from following epochs, which is not implemented in established guidelines? How was the duration of the smoothing window chosen? Indeed, 5 minutes appear as rather long could explain the poor performance of the algorithm for fast changing portions of the data (i.e. N1 or transitions). Importantly, these transitions can be very relevant in clinical settings and to establish a diagnosis.

    This is a great question. We have addressed this in the revised manuscript.

    Temporal smoothing

    We have also conducted a new analysis of the influence of the temporal smoothing on the performance. The results are described in Supplementary File 3a. Briefly, using a cross-validation approach, we have tested a total of 49 combinations of time lengths for the past and centered smoothing windows. Results demonstrated that the best performance is obtained when using a 2 min past rolling average in combination with a 7.5 minutes centered, triangular-weighted rolling average. Removing the centered rolling average resulted in poorer performance, suggesting that there is an added benefit of incorporating data from both before and after the current epoch. Removing both the past and centered rolling averages resulted in the worst performance (-3.6% decrease in F1-macro). Therefore, the new version of the manuscript and algorithm now uses a 2 min past and 7.5 min centered rolling averages. All the results in the manuscript have been updated accordingly. We have now edited the “Smoothing and normalization” subsection of the Methods section as follow:

    “In particular, the features were first duplicated and then smoothed using two different rolling windows: 1) a 7.5 minutes centered, and triangular-weighted rolling average (i.e. 15 epochs centered around the current epoch with the following weights: [0.125, 0.25, 0.375, 0.5, 0.625, 0.75, 0.875, 1., 0.875, 0.75, 0.625, 0.5, 0.375, 0.25, 0.125]), and 2) a rolling average of the last 2 minutes prior to the current epoch. The optimal time length of these two rolling windows was found using a parameter search with cross-validation (Supplementary File 3a). [...] The final model includes the 30-sec based features in original units (no smoothing or scaling), as well as the smoothed and normalized version of these raw features.”

    Reviewer #3 (Public Review):

    This study presents a new sleep scoring tool that is based on a classification algorithm using machine-learning approaches in which a set of features is extracted from the EEG signal. The algorithm was trained and validated on a very large number of nocturnal sleep datasets including participants with various ethnicities, age and health status. Results show that the algorithm offers a high level of sensitivity, specificity and accuracy matching or sometimes even exceeding that of typical interscorer agreement. The conclusions are supported by the data. Importantly, a measure of the algorithm's confidence is provided for each scored epoch in order to guide users during their review of the output. The software is described as easy to use, computationally low-demanding, open source and free. This paper addresses an important need for the field of sleep research. There is indeed a lack of accurate, flexible and open source sleep scoring tools. I would like to commend the authors for their efforts in providing such a tool for the community and for their adherence to the open science framework as the data and codes related to the current manuscript are made available. I predict that this automated tool will be of use for a large number of researchers in the field. However, there are plenty of automated sleep scoring tools already available in the field (most of them are not open source and rather expensive, as noted by the authors). The current work does not provide a clear view on whether the new algorithm presented in this research performs better than algorithms already available in the field. No formal comparisons between algorithms is provided and the matter is not discussed in the paper.

    Thanks so much for pointing this out. We have now added this relevant reference throughout the manuscript. To build on the reviewer’s point, the current algorithm and Stephansen’s algorithm did not use the same public data. The Stephansen 2018 algorithm was trained and validated on “10 different cohorts recorded at 12 sleep centers across 3 continents: SSC, WSC, IS-RC, JCTS, KHC1, AHC, IHC, DHC, FHC and CNC”, none of which are included in the training/testing sets of the current algorithm. Nevertheless, we certainly agree that the manuscript will benefit from a more extensive comparison against existing tools. To this end, we have made several major modifications to the manuscript. First, we have added a dedicated paragraph in the Introduction to review existing sleep staging algorithms:

    “Advances in machine-learning have led efforts to classify sleep with automated systems. Indeed, recent years have seen the emergence of several automatic sleep staging algorithms. While an exhaustive review of the existing sleep staging algorithms is out of the scope of this article, we review below — in chronological order — some of the most significant algorithms of the last five years. For a more in-depth review, we refer the reader to Fiorillo et al. 2019. The Sun et al. 2017 algorithm was trained on 2,000 PSG recordings from a single sleep clinic. The overall Cohen's kappa on the testing set was 0.68 (n=1,000 PSG nights). The “Z3Score” algorithm (Patanaik et al. 2018) was trained and evaluated on ~1,700 PSG recordings from four datasets, with an overall accuracy ranging from 89.8% in healthy adults/adolescents to 72.1% in patients with Parkinson’s disease. The freely available “Stanford-stage” algorithm (Stephansen et al. 2018) was trained and evaluated on 10 clinical cohorts (~3,000 recordings). The overall accuracy was 87% against the consensus scoring of several human experts in an independent testing set. The “SeqSleepNet” algorithm (Phan et al. 2019) was trained and tested using a 20-fold cross-validation on 200 nights (overall accuracy = 87.1%). Finally, the recent U-Sleep algorithm (Perslev et al. 2021) was trained and evaluated on PSG recordings from 15,660 participants of 16 clinical studies. While the overall accuracy was not reported, the mean F1-score against the consensus scoring of five human experts was 0.79 for healthy adults and 0.76 for patients with sleep apnea.”

    Second, and importantly, we now perform an in-depth comparison of YASA’s performance against the Stephansen 2018 algorithm and the Perslev 2021 algorithm using the same data for all three datasets. Specifically, we have applied the three algorithms to each night of the Dreem Open Datasets (DOD) and compared their performance in dedicated tables in the Results section (Table 2 and Table 3). This procedure is fully described in a new “Comparison against existing algorithms” subsection of the Methods. None of these algorithms included nights from the DOD in their training set, thus ensuring a fair comparison of the three algorithms. Related to point 4 of the Essential Revisions, performance of the three algorithms are reported separately for healthy individuals (DOD-Healthy, n=25) and patients with sleep apnea (DOD-Obstructive, n=50). To facilitate future validation of our algorithm, we also provide the predicted hypnograms of each night in Supplementary File 1 (healthy) and Supplementary File 2 (patients).

    Overall, the comparison results show that YASA’s accuracy is not significantly different from the Stephansen 2018 algorithm for both healthy adults and patients with obstructive sleep apnea. The accuracy of the Perslev 2021 algorithm is not significantly different from YASA in healthy adults, but is higher in patients with sleep apnea. However, it should be noted that while the YASA algorithm only uses one central EEG, one EOG and one EMG, the Perslev 2021 algorithm uses all available EEGs as well as two EOGs. This suggests that adding more EEG channels and/or using the two EOGs may improve the performance of YASA in patients with sleep apnea. Though an important counterpoint is that YASA requires a far less extensive array of data (channels) to accomplish very similar levels of accuracy, which has the favorable benefit of reducing analysis computational and processing demands, improves speed of analysis (i.e. a few seconds per recording versus ~10 min for the Stephansen 2018 algorithm), and is amenable to more data recordings since many may not have sufficient EEG channels. All these points are now discussed in detail in the new “Limitations and future directions” subsection of the Discussion (see point 3 of the Essential Revisions).

    There are some overstatements in the manuscript. For example, the algorithm was trained and validated on nocturnal sleep data. Sleep characteristics (eg duration and distribution of sleep stages etc.) are different, for example, during diurnal sleep (nap) and the algorithm might not perform as well on nap data. As such, the tool might not be as "universal" as stated in the title. Additionally, as human scores are used as the ground-truth for the validation step, it might be misleading to state that "this tool offers high sleep-staging accuracy matching or exceeding human accuracy". The algorithm exceeded the accuracy of some human scorers and matched the scores of the best scorer.

    We have now removed the word “universal” from the title and replaced “exceeded human accuracy” with “matched human accuracy”. Furthermore, we have now added the fact that the algorithm was trained and validated only on nocturnal data in the Limitations section of the discussion, and as such, noted that there is the possibility that the algorithm may not perform at the same accuracy levels for daytime nap data.

    No reflection on further improvement is offered in the paper. The algorithm performs worse on N1 stage, older individuals and patients presenting sleep disorders (sleep fragmentation) and it is unclear how this could be improved in future research. In the same vein, the current work does not present performance accuracy separately for healthy individuals and patients when it is expected that accuracy would be poorer in the patient group.

    The revised manuscript now includes a dedicated section in the Discussion to propose ideas for improvements.

    First, we have now added a “Limitations and Future Directions” subsection in the Discussion to present ideas for improving the algorithm, with a particular focus on fragmented nights and/or nights from patients with sleep disorders:

    “Despite its numerous advantages, there are limitations to the algorithm that must be considered. These are discussed below, together with ideas for future improvements of the algorithm. First, while the accuracy of YASA against consensus scoring was not significantly different from the Stephansen 2018 and Perslev 2021 algorithms on healthy adults, it was significantly lower than the latter algorithm on patients with obstructive sleep apnea. The Perslev 2021 algorithm used all available EEGs and two (bilateral) EOGs, whereas YASA’s scoring was based on one central EEG, one EOG and one EMG. This suggests that one way to improve performance in this population could be the inclusion of more EEG channels and/or bilateral EOGs. For instance, using the negative product of bilateral EOGs may increase sensitivity to rapid eye movements in REM sleep or slow eye movements in N1 sleep (Stephansen et al. 2018; Agarwal et al. 2005). Interestingly, the Perslev 2021 algorithm does not use an EMG channel, which is consistent with our observation of a negligible benefit on accuracy when adding EMG to the model. This may also indicate that while the current set of features implemented in the algorithm performs well for EEG and EOG channels, it does not fully capture the meaningful dynamic information nested within muscle activity during sleep.”

    Second, we have now conducted a random forest analysis to identify the main contributors of accuracy variability. The analysis is described in detail in the “Moderator Analyses” subsection of the Results as well as Supplementary File 3b, the revision now states:

    “To better understand how these moderators influence variability in accuracy, we quantified the relative contribution of the moderators using a random forest analysis. Specifically, we included all aforementioned demographics variables in the model, together with medical diagnosis of depression, diabetes, hypertension and insomnia, and features extracted from the ground-truth sleep scoring such as the percentage of each sleep stage, the duration of the recording and the percentage of stage transitions in the hypnograms. The outcome variable of the model was the accuracy score of YASA against ground-truth sleep staging, calculated separately for each night. All the nights in the testing set 1 were included, leading to a sample size of 585 unique nights. Results are presented in Supplementary File 3b. The percentage of N1 sleep and percentage of stage transitions — both markers of sleep fragmentation — were the two top predictors of accuracy, accounting for 40% of the total relative importance. By contrast, the combined contribution of age, sex, race and medical diagnosis of insomnia, hypertension, diabete and depression accounted for roughly 10% of the total importance.”

    In addition, the performance of the algorithm in the DOD testing dataset is now reported separately for healthy individuals and patients with sleep disorders.

    As requested by the reviewer, we now analyze and report the performance of YASA on the DOD testing set separately for healthy individuals (DOD-healthy) and patients with obstructive sleep apnea (DOD-Obstructive), which can be found in section “Testing set 2”.

    There is series of methodological choices that is not justified. For example, nights were cropped to 15 minutes before and after sleep to remove irrelevant extra periods of wakefulness or artefacts on both ends of the recording. This represents an issue for the computation of important sleep measures such as sleep efficiency and latency as the onset/offset of sleep might be missed. It is also unclear how the features were selected and a description of said features is currently missing. The custom sleep stage weights procedure is unclear. The length of the time window for the smoothing procedure seems arbitrary. Last, it is currently unclear when / how the EEG and EMG data were analyzed.

    As recommended by the reviewers, the 15-min truncation step has now been removed from the pipeline. Furthermore, the Methods section has been improved to provide more details on the features. Finally, the best class-weights and smoothing windows are now found using a cross-validation analysis on the training set. For more details, we refer the reviewer to the “Justification for some methodological choices” section below.

  2. Evaluation Summary:

    This study describes a novel algorithm to perform fully automated sleep staging in humans. It is well validated and performs at the level of other state of the art algorithms although it does not yet include comparisons against the existing tools available in the field. Given the efforts made by the authors to ensure ease of use and accessibility it may help extend the use of automated methods in the study of sleep.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #1 agreed to share their name with the authors.)

  3. Reviewer #1 (Public Review):

    Sleep staging is the key first step in the analysis of human sleep architecture which is fast becoming an important medical tool. Currently scoring is overwhelmingly manual but recently machine learning techniques has led to algorithms with human level performance. They are as yet not widely used in the clinical setting. The current manuscript aims to remove some of the stumbling blocks for the use of automated methods by providing an easy to use, well-validated algorithm that relies on polysomnographic features familiar in sleep literature so as to facilitate interpretation of results.

    The paper clearly shows that the algorithm performs within the range of the current state of the art for automated sleep scoring and is unbiased with regard to race, sex and age. It provides demanding cross-validation by using a completely external dataset for performance evaluation. The algorithm seems easy to use and is supplied with a well-documented code that should facilitate its application as the authors intended.

    Establishing whether automated methods tend to make similar "errors" as human scorers or whether they display different behaviour is essential if a transition is to be confidently made from one technique to the other. This is not systematically explored in the manuscript. A key advantage of the automated approach is the provision of a confidence measure for each epoch. However how this relates to human inter-scorer variability and accuracy is not fully explored and so the potential of this measure is not clear.

  4. Reviewer #2 (Public Review):

    1. The authors describe their algorithm as a tool that (i) was validated "across heterogeneous populations around the world"; (ii) has an "accuracy matching or exceeding human accuracy"; (iii) "is easy to use". I take issue with these three statements.
    First, the authors did not test the performance of their algorithm in clinical populations with sleep disorders, despite the fact that individuals with sleep disorders represent (logically) the vast majority of sleep recordings. Crucially, such a comparison was made in the best (to my knowledge) published automated sleep staging algorithm (Stephansen et al. Nature Communications 2018, doi: 10.1038/s41467-018-07229-3). The omission of this work is very surprising. Quantifying the impact of sleep disorders on a sleep scoring algorithm is critical for its deployment in sleep clinics.
    Second, the authors wrote that their algorithm is "matching or exceeding" human accuracy but seem to present uncorrected one-to-one comparisons to support their claim. The fact that an algorithm is better than some humans do not mean it exceeds human performance.
    Third, although I agree that the tool seems easy to use even for individuals with limited programming skills, it still requires some. I don't think someone who is used to software with graphical interfaces and who has never used (or heard of!) python would describe the tool as easy to use. This poses an important implementation challenge.

    2. An important limitation of this algorithm is that it captures only one part of the visual examination of sleep data. Indeed, especially in clinical settings, the data is not only examined to establish the hypnogram but to also identify markers of common sleep disorders (e.g. sleep apnea, leg movements, etc). Although this algorithm could significantly speed up sleep scoring, it does not allow to detect these other important markers. Currently, and in link with the previous comment, the algorithm could not replace the visual inspection of the data for clinical diagnoses.

    3. The data were curated with some recordings or portions of recordings being excluded (see p. 7). While I understand that this curation is important for the training set, I think it should not be applied to the test set. Indeed, it goes contrary to the logic of automating sleep staging. For example, cutting the beginning and end of the recording according to sleep start and end (p. 7) supposes that the start and end of sleep are already known (i.e. it has already been scored).

    4. Two types of EEG derivations were used (C4-M1 or C4-Fpz). Was the performance impacted by this variable? Is it fair to assume that the choice of features (spectral features or summary statistics of time series data) could explain the absence of differences but that introducing new features (i.e. phase-sensitive features) could increase the influence of the choice of the derivation?

    5. Given that markers of sleep stages are very different in EOG, EMG and EEG time series, could the authors explain the logic behind applying the same pre-processing and extracting the same features on these three very different types of data? Could this explain why the majority of the features in the top-20 features were EEG features?

    6. Sleep scoring guidelines incorporate not only what can be observed on a given epoch of data but also what is observed in the previous epoch(s). For example, an epoch can be scored as N2 even if there is no marker of N2 but there was (i) a marker of N2 in a previous epoch, (ii) no reason to change the score since. To reproduce this, the authors employed a symmetrical smoothing approach (a combination of a triangular-weighted rolling average and asymmetrical rolling average). Why did the authors choose to incorporate data from following epochs, which is not implemented in established guidelines? How was the duration of the smoothing window chosen? Indeed, 5 minutes appear as rather long could explain the poor performance of the algorithm for fast changing portions of the data (i.e. N1 or transitions). Importantly, these transitions can be very relevant in clinical settings and to establish a diagnosis.

  5. Reviewer #3 (Public Review):

    This study presents a new sleep scoring tool that is based on a classification algorithm using machine-learning approaches in which a set of features is extracted from the EEG signal. The algorithm was trained and validated on a very large number of nocturnal sleep datasets including participants with various ethnicities, age and health status. Results show that the algorithm offers a high level of sensitivity, specificity and accuracy matching or sometimes even exceeding that of typical interscorer agreement. The conclusions are supported by the data. Importantly, a measure of the algorithm's confidence is provided for each scored epoch in order to guide users during their review of the output. The software is described as easy to use, computationally low-demanding, open source and free.
    This paper addresses an important need for the field of sleep research. There is indeed a lack of accurate, flexible and open source sleep scoring tools. I would like to commend the authors for their efforts in providing such a tool for the community and for their adherence to the open science framework as the data and codes related to the current manuscript are made available. I predict that this automated tool will be of use for a large number of researchers in the field.
    However, there are plenty of automated sleep scoring tools already available in the field (most of them are not open source and rather expensive, as noted by the authors). The current work does not provide a clear view on whether the new algorithm presented in this research performs better than algorithms already available in the field. No formal comparisons between algorithms is provided and the matter is not discussed in the paper.

    There are some overstatements in the manuscript. For example, the algorithm was trained and validated on nocturnal sleep data. Sleep characteristics (eg duration and distribution of sleep stages etc.) are different, for example, during diurnal sleep (nap) and the algorithm might not perform as well on nap data. As such, the tool might not be as "universal" as stated in the title. Additionally, as human scores are used as the ground-truth for the validation step, it might be misleading to state that "this tool offers high sleep-staging accuracy matching or exceeding human accuracy". The algorithm exceeded the accuracy of some human scorers and matched the scores of the best scorer.

    No reflection on further improvement is offered in the paper. The algorithm performs worse on N1 stage, older individuals and patients presenting sleep disorders (sleep fragmentation) and it is unclear how this could be improved in future research. In the same vein, the current work does not present performance accuracy separately for healthy individuals and patients when it is expected that accuracy would be poorer in the patient group.

    There is series of methodological choices that is not justified. For example, nights were cropped to 15 minutes before and after sleep to remove irrelevant extra periods of wakefulness or artefacts on both ends of the recording. This represents an issue for the computation of important sleep measures such as sleep efficiency and latency as the onset/offset of sleep might be missed. It is also unclear how the features were selected and a description of said features is currently missing. The custom sleep stage weights procedure is unclear. The length of the time window for the smoothing procedure seems arbitrary. Last, it is currently unclear when / how the EEG and EMG data were analyzed.