Rapid, Reference-Free human genotype imputation with denoising autoencoders

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    The paper describes a novel neural-network-based strategy for imputing unmeasured genotypes, which is standard part of most association testing pipelines. The method, while computationally intensive to train, can perform the imputation quickly and conveniently, and has the potential to be a practically-appealing alternative to existing methods. However, further work will be required to realize this potential, and further data are required to support the accuracy of the method.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Genotype imputation is a foundational tool for population genetics. Standard statistical imputation approaches rely on the co-location of large whole-genome sequencing-based reference panels, powerful computing environments, and potentially sensitive genetic study data. This results in computational resource and privacy-risk barriers to access to cutting-edge imputation techniques. Moreover, the accuracy of current statistical approaches is known to degrade in regions of low and complex linkage disequilibrium. Artificial neural network-based imputation approaches may overcome these limitations by encoding complex genotype relationships in easily portable inference models. Here, we demonstrate an autoencoder-based approach for genotype imputation, using a large, commonly used reference panel, and spanning the entirety of human chromosome 22. Our autoencoder-based genotype imputation strategy achieved superior imputation accuracy across the allele-frequency spectrum and across genomes of diverse ancestry, while delivering at least fourfold faster inference run time relative to standard imputation tools.

Article activity feed

  1. Author Response

    Reviewer #1 (Public Review):

    Dias et al proposed a new method for genotype imputation and evaluated its performance using a variety of metrics. Their method consistently produces better imputation accuracies across different allele frequency spectrums and ancestries. Surprisingly, this is achieved with superior computational speed, which is very impressive since competing imputation softwares had decades of experience in optimizing software performance.

    The main weakness in my opinion is the lack of software/pipeline descriptions, as detailed in my main points 36 below.

    We have made the source code and detailed instructions available publicly at Github. The computational pipeline for autoencoder training and validation is available at https://github.com/TorkamaniLab/Imputation_Autoencoder/tree/master/autoencoder_tuning_pipeline.

    1. In the neural network training workflow, I am worried it will be difficult to compute the n by n correlation matrix if n is large. If n=10^5, the matrix would be ~80GB in double precision, and if n=10^6, the matrix is ~2TB. I wonder what is n for HRC chromosome 1? Would this change for TOPMed (Taliun 2021 Nature) panel which has ~10x more variants? I hope the authors can either state that typical n is manageable even for dense sequencing data, or discuss a strategy for dealing with large n. Also, Figure 1 is a bit confusing, since steps E1-E2 supposedly precede A-D.

    We included more details in the methods section to address this question. It is true that computing the entirety of this matrix is computationally intensive, thus, in order to avoid this complexity, we calculated the correlations in a sliding box of 500 x 500 common variants (minor allele frequency (MAF) >=0.5%). In other words, no matter how dense the genomic data is, the n x n size will always be fixed to 500 x 500. Larger datasets will not influence this as the additional variants fall below the MAF>=0.5% threshold. Thus, memory utilization will be the same regardless of chromosome length or database size. Please note that this correlation calculation process is not necessary for the end-user to perform imputation, since we already provide the information on what genomic coordinates belong to the local minima or “cutting points” of the genome. This computational burden remains on the developer side. The reviewer is right to point out that Figure 1 is misleading in its ordering, we have corrected this in the revision.

    1. I have a number of questions/comments regarding equations 2-4. (a) There seems to be no discussion on how the main autoencoder weight parameters were optimized? Intuitively, I would think optimizing the autoencoder weights are conceptually much more important than tuning hyper-parameters, for which there are plenty of discussions.

    These parameters are optimized through the training process described in “Hyperparameter Initialization and Grid Search / Hyperparameter Tuning” - where both the hyperparameters and edge weights are determined for each autoencoder for each genomic segment. There are 256 genomic segments in chromosome 22, and each segment has a different number of input variables, sparsity, and correlation structure. Thus, there is a unique autoencoder model that best fits each genomic tile (e.g.: each autoencoder has different weights, architecture, loss function, regularizes, and optimization algorithms). Therefore, while there are some commonalities across genomic tiles, there is not a single answer for the number of dimensions of the weight matrix, or for how the weights were optimized. Instructions on how to access the unique information on the parameters and hyperparameters of each one of the 256 autoencoders is now shared through our source code repository at https://github.com/TorkamaniLab/imputator_inference.

    We included an additional explanation clarifying this point in the Hyperparameter Tuning subsection of the Methods.

    (b) I suppose t must index over each allele in a segment, but this was not explicit.

    That is correct, t represents the index of each allele in a genomic segment. We included this statement in the description of equation 2.

    (c) Please use standard notations for L1 and L2 norms (e.g. ||Z||_1 for L1 norm of Z). I also wonder if the authors meant ||Z||_1 or ||vec(Z)||_1 (vectorized Z)?

    We included a clarification in the description of equation 3. ‖𝑾‖𝟏 and ‖𝑾‖𝟐 are the standard L1 and L2 norms of the autoencoder weight matrix (W).

    (d) It would be great if the authors can more explicitly describe the auto-encoder matrices (e.g. their dimensions, sparsity patterns if any...etc).

    As we answered in comment 2.a, each one of the 256 autoencoders for each genomic segment is unique, so it would be unfeasible to describe the architecture, parameters, optimizers, loss function, regularizes, of each one of them. We realized it would be more suitable to share this information in a software repository and have now done so.

    1. It is not obvious if the authors intend to provide a downloadable software package that is user-friendly and scalable to large data (e.g. HRC). For the present paper to be useful to others, I imagine either (a) the authors provide software or example scripts so users can train their own neural network, or (b) the authors provide pretrained networks that are downloaded and can be easily combined with target genotype data for imputation. From the discussion, it seems like (b) would be the ultimate goal, but is only part dream and part reality. It would be helpful if the authors can clarify how current users can benefit from their work.

    We have now shared the pre-trained autoencoders (including model weights and inference source code) and instructions on how to use them for imputation. These resources are publicly available at https://github.com/TorkamaniLab/imputator_inference. We have added this information to the Data Availability subsection of the Methods.

    1. Along the same lines, I also found the description of the software/pipeline to be lacking (unless these information are available on the online GitHub page, which is currently inaccessible). For instance, I would like to know which of the major data imputation formats (VCF/BGEN..etc) are supported? Which operating systems (window/linux/mac) are supported? I also would like to know if it is possible to train the network or run imputation given pre-trained networks, if I don't have a GPU?

    We have now made the github repository publicly available. The description of the requirements and steps performed in the hyperparameter tuning pipeline is available at https://github.com/TorkamaniLab/Imputation_Autoencoder/tree/master/autoencoder_tuning_pipeline.

    1. Typically, imputation software supplies a per-SNP imputation quality score for use in downstream analysis. This is important for interpretability as it helps users decide which variants are confidently imputed and which ones are not. For example, such a quality score can be estimated from the posterior distribution of an HMM process (e.g. Browning 2009 AJHG). Would the proposed method be able to supply something similar? Alternatively, how would the users know which imputed variants to trust?

    We included further clarification in the data availability session of methods: Imputation data format. The imputation results are exported in variant calling format (VCF) containing the imputed genotypes and imputation quality scores in the form of class probabilities for each one of the three possible genotypes (homozygous reference, heterozygous, and homozygous alternate allele). The probabilities can be used for quality control of the imputation results.

    We included this clarification in the manuscript and in the readme file of the inference software repository https://github.com/TorkamaniLab/imputator_inference.

    1. I think the authors should clarify whether input genotypes must be prephased. That is, given a trained neural network and a genotype data that one wishes to impute, does the genotype data have to be phased? The discussion reads "our current encoding approach lacks phasing information..." which can be understood both ways. On a related note, I hope the authors can also clarify if the validation and testing data (page 7 lines 1423) were phased data, or if they were originally unphased but computationally phased via softwares like Eagle 2 or Beagle 5.

    The input genotypes are not phased, nor pre-phased, and no pre-phasing was performed before imputation. We included further clarification on the method section, stating “All input genotypes from all datasets utilized in this work are unphased, and no pre-phasing was performed.”. We also included further clarification in the Discussion session.

    1. It is unclear if the reported run times (Figure 6) includes model training time, or if they are simply imputing the missing genotypes given a pre-trained autoencoder? For the later, I think the comparison may still be fair if users never have to train models themselves. However, if users currently have to train their own network, I feel it is imperative to also report the model training time, even if in another figure/table.

    The end-users do not have to train the models, the computational burden of training the models remains on the developer side, so the runtimes refer to the task of imputing the missing genotypes given a pre-trained autoencoder set. This allows for distribution without reference datasets. We included further clarification on the Performance Testing and Comparisons subsection of Methods.

    Reviewer #2 (Public Review):

    In this manuscript the authors introduce a segment based autoencoder (AE) to perform genotype imputation. The authors compare performance of their AE to more traditional HMM-based methods (e.g. IMPUTE) and show that there is a slight but significant improvement on these methods using the AE strategy.

    In general the paper is clearly presently and the work in timely, but I have some concerns with respect to the framing of the advances presented here along with the performance comparisons.

    Specific Points:

    1. The authors aren't doing a good enough job presenting the work of others in using deep neural networks for imputation or using autoencoders for closely related tasks in population genetics. For instance, the authors say that the RNN method of Kojima et al 2020. is not applicable to real world scenarios, however they seem to have missed that in that paper the authors are imputing based on omni 2.5 at 97% masking, right in line with what is presented here. It strikes me that the RNNIMP method is a crucial comparison here, and the authors should expand their scholarship in the paper to cover work that has already been done on autoencoders for popgen.

    This is an important comparison that we erroneously misrepresented. We have now separated out this particular application of the RNN-IMP in the introduction of the manuscript. The major difference is that RNN-IMP needs to be retrained on different input genetic variants, much like a standard HMM-based method. The computational burden of RNN-IMP remains on the end-user side. It appears that computational complexity is tremendous in this model, given that the only example the authors provided with their software consists of 100 genomes from 1000 Genomes Project to perform the imputation on Omni by de novo training of the data. Given their approach does not achieve the benefits of distributing a generalizable pre-trained neural network, and the computational burden associated with training these models on the 60K+ genomes we use in our manuscript, we have opted for stating the benefits and downsides of their approach in the introduction.

    1. With respect to additional comparisons-Kenneth Lange's group recently released a new method for imputation which is not based on HMM but is extremely fast. The authors would be well served to extend their comparisons to include this method (MendelImpute)-it should be favorable for the authors as ModelImpute is less accurate than HMMs but much faster.

    We appreciate the reviewer pointing out this additional method, however their parent manuscript clearly shows substantially inferior imputation performance relative to BEAGLE/Minimac etc. which we already compare against. There is not much to gain by performing this comparison. Our autoencoder-based approach is already generating results that are competitive with the best and most cited imputation tools, which are all HMM-based and outperforming MendelImpute. The outcome of this comparison is forecasted based upon the parent manuscript.

    1. The description of HMM based methods in lines 19-21 isn't quite correct. Moreover-what is an "HMM parameter function?"

    Thank you for catching this. We were referring to parameter *estimation and have corrected this in the manuscript.

    1. Using tiled AEs across the genome makes sense given the limitations of AEs generally, but this means that tiling choices may affect downstream accuracy. In particular-how does the choice of the LD threshold determine accuracy of the method? e.g. if the snp correlation threshold were 0.3 rather than 0.45, how would performance be changed?

    This choice is driven by the limitations of cutting-edge GPUs. 0.45 is the threshold that returns the minimum number of tiles spanning chromosome 22 with an average size per tile that fits into the video memory of GPUs. While developing the tiling algorithm, we tested lower thresholds, which made the tiles smaller and more abundant, and thus made the GPU memory workload less efficient (e.g. many tiles resulted in many autoencoders per GPU, which thus caused a CPU-GPU communication overhead). Due to the obstacles related to computational inefficiency, CPUGPU communication overhangs, and GPU memory limits, we did not proceed with model training on tiles generated with other correlation thresholds. We’ve added a paragraph explaining this choice in the manuscript.

    1. How large is the set of trained AEs for chromosome 22? In particular, how much disk space does the complete description of all AEs (model + weights) take up? How does this compare to a reference panel for chr22? The authors claim that one advance is that this is a "reference-free" method - it's not - and that as such there are savings in that a reference panel doesn't have to be used along with the genome to be imputed. While the later claim is true, instead a reference panel is swapped out for a set of trained AEs, which might take up a lot of disk space themselves. This comparison should be given and perhaps extrapolated to the whole genome.

    This is an interesting point. For comparison, the total combined uncompressed size of all pre-trained autoencoders together is 120GB, or 469MB per autoencoder. The size of the reference data, HRC chromosome 22 across ~27,000 samples is 1GB after compression – or nearly 10X the autoencoder size. Moreover, unlike in HMM-based imputation, the size of the pre-trained autoencoders does not increase as a function of the reference panel sample size. The size of the autoencoders remains fixed since the number of model weights and parameters remains the same regardless of sample size – though it will expand somewhat with the addition of new genetic variants. Another point to consider is that privacy concerns associated with distribution of reference data are mitigated with these pretrained autoencoders.

    1. The results around runtime performance (Figure 6) are misleading. Specifically HMM training and decoding is being performed here, whereas for the AE only prediction (equivalent to decoding) is being done. To their credit, the authors do mention a bit of this in the discussion, however a real comparison should be done in Figure 6. There are two ways to proceed in my estimation - 1) separate training and decoding for the HMM methods (Beagle doesn't allow this, I'm not sure of the other software packages) 2) report the training times for the AE method. I would certainly like to see what the training times look like given that the results as present require 1) a separate AE for each genomic chunk, 2) a course grid search, 3) training XGBoost on the results from the course grid search, and 4) retraining of the individual AEs given the XGBoost predictions, and 5) finally prediction. This is a HUGE training effort. Showing prediction runtimes and comparing those to the HMMs is inappropriate.

    We consider the prediction only during the runtime comparisons because only the prediction side is done by the enduser, whereas the computational burden remains on the developer side. For the HMMs, we included only the prediction time as well (excluded the time for data loading/writing, computing model parameters and HMM iterations). The pre-trained autoencoders, when distributed, can take as input any set of genetic variants to produce the output without any additional training or fine-tuning required.

    1. One well known problem for DNN based methods including AEs is out-of-sample prediction. While Figure 5 (missing a label by the way) sort of gets to this, I would have the authors compare prediction in genotypes from populations which are absent from the training set and compare that performance to HMMs. Both methods should suffer, but I'm curious as to whether the AEs are more robust than the HMMs to this sort of pathology.

    Our test datasets in Figures 4 and 5 are independent of the reference dataset. MESA, Wellderly, and HGDP are all independent datasets, never used for training, nor model selection. Only HRC was used as reference panel or for training, and ARIC was used for model selection during tuning. We included a statement in the methods clarifying this point.

    Reviewer #3 (Public Review):

    Over the last 15 years or so genotype imputation has been an important and widely-used tool in genetic studies, with methods based on Hidden Markov Models (HMMs) and reference panels emerging as the dominant approach. This paper suggests a new approach to genotype imputation based on denoising autoencoders (DAE), a type of neural network. This approach has two nice advantages over existing methods based on Hidden Markov Models (HMMs): i) once the DAE is trained on a reference panel the reference panel can be discarded, and users do not need access to the reference panel to use the DAE; ii) imputation using a DAE is very fast (training is slow, but this step is done upfront so users do not need to worry about it). The paper also presents data showing that the tuned DAE is competitive in accuracy with HMM methods.

    I have two main concerns.

    First, it is unclear to me whether the accuracy presented for the tuned DAE (eg Figure 3, Table 4) is a reliable reflection of expected future accuracy. This is because the tuning process was quite extensive and complex, and involved at least some of the datasets used in these assessments. While the paper correctly attempts to guard against overfitting and related issues by using separate Training, Validation and Testing data (p7), it seems that the Testing data were used in at least some of the development of the methods and tuning (eg p14, "A preliminary comparison of the best performing autoencoder..."; Figure 2 and Table 2, all involve the Testing data). Because of the complexity of the process by which the final DAE was arrived at it is unclear to me whether there is a genuine concern here, but it would seem safest and most convincing at this point to do an entirely independent test of the methods on genotype data sets that were not used at all up to this point.

    MESA, Wellderly, and HGDP were not used for training, nor for tuning, they are completely independent. So all the results showing these datasets are completely independent. Only HRC and ARIC were used for training and validation/tuning, respectively. We included a statement in the methods session clarifying this point.

    Moreover, HGDP in particular includes 828 samples from 54 different populations representing all continental populations and including remote populations like Siberia, Oceania, etc. This reference panel is described in more detail in the reference below and likely represents the most diverse human genome dataset available. Thus, we have externally validated generalizability on a dataset with much greater diversity than our training dataset:

    Bergström A, et al. Insights into human genetic variation and population history from 929 diverse genomes. Science. 2020 Mar 20;367(6484):eaay5012.

    Second, there is a potentially tricky issue of to what extent distributing a black box DAE trained on a reference sample is consistent with data sharing policies. Standards of data sharing have evolved over the last decade. Generally there currently seems to be little hesitation to publicly share "single-SNP summary data" such as allele frequency information from large reference panels, whereas sharing of individual-level genotype data is usually explicitly forbidden. It is not quite clear to me where sharing the fit of a DAE falls here, or how much information on individual genotypes the trained DAE contains. The current manuscript does not adequately address this issue.

    Currently there are no official data sharing restrictions on deep learning data. We are aware that future policies may rise, and we have started a collaboration with Oak Ridge National Laboratory to explore differential privacy techniques and privacy concerns for these autoencoders. Another point to consider is that the autoencoders segment the genome, making reconstruction of an individual genome impossible even if reference data were somehow recoverable from the neural networks. Regardless, this is an interesting and important point that should be addressed in the manuscript and we have added a paragraph discussing this point.

    Reviewer #4 (Public Review):

    In this manuscript, Dias et al proposed a novel genotype imputation method using autoencoders (AE), which achieves comparable or superior accuracy relative to the state-of-the-art HMM-based imputation methods after tuning. The idea is innovative and provides an alternative solution to the important task of genotype imputation. The authors also conducted some experiments using three different datasets as targets to showcase the value of their approach. The overall framework of the method is clearly presented but more technical details are needed. The results presented showed slight advantage of AE imputation after tuning but more comprehensive evaluations are needed. In particular, the authors didn't consider post-imputation quality control. The reported overall performance (R2 in the range of 0.2-0.6) seems low and inconsistent with the imputation literature.

    Overall, the method has potential but is not sufficiently compelling in its current form.

    We show average accuracy of 0.2-0.6 in Table 4, but that is the average R2 per variant across all variants (no MAF filtering or binning applied). The reviewer points that the accuracy should be R2>0.8, but this R2>0.8 refers to common variants only (allele frequency >1%), and we have shown r2>0.8 for these variants (Figure 4). The aggregate accuracy displayed in Table 4 is lower because the vast majority of variants fall below 1% allele frequency threshold.

    The references bellow demonstrate this issue and agree with our results:

    References:

    Rubinacci S, Delaneau O, Marchini J. Genotype imputation using the positional burrows wheeler transform. PLoS genetics. 2020 Nov 16;16(11):e1009049.

    McCarthy S, Das S, Kretzschmar W, Delaneau O, Wood AR, Teumer A, Kang HM, Fuchsberger C, Danecek P, Sharp K, Luo Y. A reference panel of 64,976 haplotypes for genotype imputation. Nature genetics. 2016 Oct;48(10):1279.

    Vergara C, Parker MM, Franco L, Cho MH, Valencia-Duarte AV, Beaty TH, Duggal P. Genotype imputation performance of three reference panels using African ancestry individuals. Human genetics. 2018 Apr;137(4):281-92.

  2. Evaluation Summary:

    The paper describes a novel neural-network-based strategy for imputing unmeasured genotypes, which is standard part of most association testing pipelines. The method, while computationally intensive to train, can perform the imputation quickly and conveniently, and has the potential to be a practically-appealing alternative to existing methods. However, further work will be required to realize this potential, and further data are required to support the accuracy of the method.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. The reviewers remained anonymous to the authors.)

  3. Reviewer #1 (Public Review):

    Dias et al proposed a new method for genotype imputation and evaluated its performance using a variety of metrics. Their method consistently produces better imputation accuracies across different allele frequency spectrums and ancestries. Surprisingly, this is achieved with superior computational speed, which is very impressive since competing imputation software's had decades of experience in optimizing software performance.

    The main weakness in my opinion is the lack of software/pipeline descriptions, as detailed in my main points 3-6 below.

    1. In the neural network training workflow, I am worried it will be difficult to compute the n by n correlation matrix if n is large. If n=10^5, the matrix would be ~80GB in double precision, and if n=10^6, the matrix is ~2TB. I wonder what is n for HRC chromosome 1? Would this change for TOPMed (Taliun 2021 Nature) panel which has ~10x more variants? I hope the authors can either state that typical n is manageable even for dense sequencing data, or discuss a strategy for dealing with large n. Also, Figure 1 is a bit confusing, since steps E1-E2 supposedly precede A-D.

    2. I have a number of questions/comments regarding equations 2-4. (a) There seems to be no discussion on how the main autoencoder weight parameters were optimized? Intuitively, I would think optimizing the autoencoder weights are conceptually much more important than tuning hyper-parameters, for which there are plenty of discussions. (b) I suppose t must index over each allele in a segment, but this was not explicit. (c) Please use standard notations for L1 and L2 norms (e.g. ||Z||_1 for L1 norm of Z). I also wonder if the authors meant ||Z||_1 or ||vec(Z)||_1 (vectorized Z)? (d) It would be great if the authors can more explicitly describe the auto-encoder matrices (e.g. their dimensions, sparsity patterns if any...etc).

    3. It is not obvious if the authors intend to provide a downloadable software package that is user-friendly and scalable to large data (e.g. HRC). For the present paper to be useful to others, I imagine either (a) the authors provide software or example scripts so users can train their own neural network, or (b) the authors provide pre-trained networks that are downloaded and can be easily combined with target genotype data for imputation. From the discussion, it seems like (b) would be the ultimate goal, but is only part dream and part reality. It would be helpful if the authors can clarify how current users can benefit from their work.

    4. Along the same lines, I also found the description of the software/pipeline to be lacking (unless these information are available on the online GitHub page, which is currently inaccessible). For instance, I would like to know which of the major data imputation formats (VCF/BGEN..etc) are supported? Which operating systems (window/linux/mac) are supported? I also would like to know if it is possible to train the network or run imputation given pre-trained networks, if I don't have a GPU?

    5. Typically, imputation software supplies a per-SNP imputation quality score for use in downstream analysis. This is important for interpretability as it helps users decide which variants are confidently imputed and which ones are not. For example, such a quality score can be estimated from the posterior distribution of an HMM process (e.g. Browning 2009 AJHG). Would the proposed method be able to supply something similar? Alternatively, how would the users know which imputed variants to trust?

    6. I think the authors should clarify whether input genotypes must be prephased. That is, given a trained neural network and a genotype data that one wishes to impute, does the genotype data have to be phased? The discussion reads "our current encoding approach lacks phasing information..." which can be understood both ways. On a related note, I hope the authors can also clarify if the validation and testing data (page 7 lines 14-23) were phased data, or if they were originally unphased but computationally phased via software like Eagle 2 or Beagle 5.

    7. It is unclear if the reported run times (Figure 6) includes model training time, or if they are simply imputing the missing genotypes given a pre-trained autoencoder? For the later, I think the comparison may still be fair if users never have to train models themselves. However, if users currently have to train their own network, I feel it is imperative to also report the model training time, even if in another figure/table.

  4. Reviewer #2 (Public Review):

    In this manuscript the authors introduce a segment based autoencoder (AE) to perform genotype imputation. The authors compare performance of their AE to more traditional HMM-based methods (e.g. IMPUTE) and show that there is a slight but significant improvement on these methods using the AE strategy.

    In general the paper is clearly presently and the work in timely, but I have some concerns with respect to the framing of the advances presented here along with the performance comparisons.

    Specific Points:

    1. The authors aren't doing a good enough job presenting the work of others in using deep neural networks for imputation or using autoencoders for closely related tasks in population genetics. For instance, the authors say that the RNN method of Kojima et al 2020. is not applicable to real world scenarios, however they seem to have missed that in that paper the authors are imputing based on omni 2.5 at 97% masking, right in line with what is presented here. It strikes me that the RNNIMP method is a crucial comparison here, and the authors should expand their scholarship in the paper to cover work that has already been done on autoencoders for popgen.
    2. With respect to additional comparisons-Kenneth Lange's group recently released a new method for imputation which is not based on HMM but is extremely fast. The authors would be well served to extend their comparisons to include this method (MendelImpute)-it should be favorable for the authors as ModelImpute is less accurate than HMMs but much faster.
    3. The description of HMM based methods in lines 19-21 isn't quite correct. Moreover-what is an "HMM parameter function?"
    4. Using tiled AEs across the genome makes sense given the limitations of AEs generally, but this means that tiling choices may affect downstream accuracy. In particular-how does the choice of the LD threshold determine accuracy of the method? e.g. if the snp correlation threshold were 0.3 rather than 0.45, how would performance be changed?
    5. How large is the set of trained AEs for chromosome 22? In particular, how much disk space does the complete description of all AEs (model + weights) take up? How does this compare to a reference panel for chr22? The authors claim that one advance is that this is a "reference-free" method - it's not - and that as such there are savings in that a reference panel doesn't have to be used along with the genome to be imputed. While the later claim is true, instead a reference panel is swapped out for a set of trained AEs, which might take up a lot of disk space themselves. This comparison should be given and perhaps extrapolated to the whole genome.
    6. The results around runtime performance (Figure 6) are misleading. Specifically HMM training and decoding is being performed here, whereas for the AE only prediction (equivalent to decoding) is being done. To their credit, the authors do mention a bit of this in the discussion, however a real comparison should be done in Figure 6. There are two ways to proceed in my estimation - 1) separate training and decoding for the HMM methods (Beagle doesn't allow this, I'm not sure of the other software packages) 2) report the training times for the AE method. I would certainly like to see what the training times look like given that the results as present require 1) a separate AE for each genomic chunk, 2) a course grid search, 3) training XGBoost on the results from the course grid search, and 4) retraining of the individual AEs given the XGBoost predictions, and 5) finally prediction. This is a HUGE training effort. Showing prediction runtimes and comparing those to the HMMs is inappropriate.
    7. One well known problem for DNN based methods including AEs is out-of-sample prediction. While Figure 5 (missing a label by the way) sort of gets to this, I would have the authors compare prediction in genotypes from populations which are absent from the training set and compare that performance to HMMs. Both methods should suffer, but I'm curious as to whether the AEs are more robust than the HMMs to this sort of pathology.

  5. Reviewer #3 (Public Review):

    Over the last 15 years or so genotype imputation has been an important and widely-used tool in genetic studies, with methods based on Hidden Markov Models (HMMs) and reference panels emerging as the dominant approach. This paper suggests a new approach to genotype imputation based on denoising autoencoders (DAE), a type of neural network. This approach has two nice advantages over existing methods based on Hidden Markov Models (HMMs): i) once the DAE is trained on a reference panel the reference panel can be discarded, and users do not need access to the reference panel to use the DAE; ii) imputation using a DAE is very fast (training is slow, but this step is done upfront so users do not need to worry about it). The paper also presents data showing that the tuned DAE is competitive in accuracy with HMM methods.

    I have two main concerns.

    First, it is unclear to me whether the accuracy presented for the tuned DAE (eg Figure 3, Table 4) is a reliable reflection of expected future accuracy. This is because the tuning process was quite extensive and complex, and involved at least some of the datasets used in these assessments. While the paper correctly attempts to guard against overfitting and related issues by using separate Training, Validation and Testing data (p7), it seems that the Testing data were used in at least some of the development of the methods and tuning (eg p14, "A preliminary comparison of the best performing autoencoder..."; Figure 2 and Table 2, all involve the Testing data). Because of the complexity of the process by which the final DAE was arrived at it is unclear to me whether there is a genuine concern here, but it would seem safest and most convincing at this point
    to do an entirely independent test of the methods on genotype data sets that were not used at all up to this point.

    Second, there is a potentially tricky issue of to what extent distributing a black box DAE trained on a reference sample is consistent with data sharing policies. Standards of data sharing have evolved over the last decade. Generally there currently seems to be little hesitation to publicly share "single-SNP summary data" such as allele frequency information from large reference panels, whereas sharing of individual-level genotype data is usually explicitly forbidden. It is not quite clear to me where sharing the fit of a DAE falls here, or how much information on individual genotypes the trained DAE contains. The current manuscript does not adequately address this issue.

  6. Reviewer #4 (Public Review):

    In this manuscript, Dias et al proposed a novel genotype imputation method using autoencoders (AE), which achieves comparable or superior accuracy relative to the state-of-the-art HMM-based imputation methods after tuning. The idea is innovative and provides an alternative solution to the important task of genotype imputation. The authors also conducted some experiments using three different datasets as targets to showcase the value of their approach. The overall framework of the method is clearly presented but more technical details are needed. The results presented showed slight advantage of AE imputation after tuning but more comprehensive evaluations are needed. In particular, the authors didn't consider post-imputation quality control. The reported overall performance (R2 in the range of 0.2-0.6) seems low and inconsistent with the imputation literature. Overall, the method has potential but is not sufficiently compelling in its current form.