Real-time identification of epistatic interactions in SARS-CoV-2 from large genome collections

This article has been Reviewed by the following groups

Read the full article See related articles

Listed in

Log in to save this article

Abstract

The emergence and rapid spread of the SARS-CoV-2 virus has highlighted the importance of genomic epidemiology in understanding the evolution of pathogens and for guiding public health interventions. In particular, the Omicron variant underscored the role of epistasis in the evolution of lineages with both higher infectivity and immune escape, and therefore the necessity to update surveillance pipelines to detect them as soon as they emerge. In this study we applied a method based on mutual information (MI) between positions in a multiple sequence alignment (MSA), which is capable of scaling up to millions of samples. We showed how it could reliably predict known experimentally validated epistatic interactions, even when using as little as 10,000 sequences, which opens the possibility of making it a near real-time prediction system. We tested this possibility by modifying the method to account for sample collection date and applied it retrospectively to MSAs for each month between March 2020 and March 2023. We could detect a cornerstone epistatic interaction in the Spike protein between codons 498 and 501 as soon as 6 samples with a double mutation were present in the dataset, thus demonstrating the method’s sensitivity. Lastly we provide examples of predicted interactions between genes, which are harder to test experimentally and therefore more likely to be overlooked. This method could become part of continuous surveillance systems tracking present and future pathogen outbreaks.

Article activity feed

  1. Note: This response was posted by the corresponding author to Review Commons. The content has not been altered except for formatting.

    Learn more at Review Commons


    Reply to the reviewers

    Manuscript number: RC-2023-02154

    Corresponding author(s): Marco, Galardini

    General Statements

    We have carefully read the comments put forward by the two reviewers and we have produced a revised version of the manuscript that we believe addresses all the concerns expressed by the reviewers. In short, we have validated our approach against experimentally derived epistatic coefficients, compared our mutual information (MI) method against one that uses direct coupling analysis (DCA), and experimentally tested three interactions in the spike RBD that we have predicted and which emerged only in summer 2023, thus demonstrating the potential predictive power of this approach. We have also carefully reworded the manuscript to acknowledge the inherent limitation of a method based on MI to identify epistatic interactions. We believe that the revised manuscript is now more robust with these new in-silico and in-vitro validations, and more direct in exposing the advantages (speed) and caveats (higher false-positives) of this approach.

    Note: the line numbers referenced in the responses to reviewers below refer to the document in which the changes are highlighted.

    Point-by-point description of the revisions

    Reviewer #1 (Evidence, reproducibility and clarity (Required)):

    Summary: The authors inferred the pairwise epistasis through the Mutual Information provided by the spydrpick algorithm. They claim that the MIs could serve as a real-time identification of the epistatic interactions with the SARS-CoV-2 genomes due to the fast inference and high sensitivities.

    Major comments:

    1.The authors take a data-driven approach to infer the Mutation Information as the epistatic interactions between the mutations over different sites over SARS-CoV-2 genomes. However, it would be better to specify why this metric is reliable to be used as the representation of the pairwise epistatic interactions, and any theoretical explanations to support this.

    We agree that readers should be better informed on why MI can be used to estimate epistatic interactions from genomic data. We have therefore expanded the introduction (lines 93-98), methods (lines 540-543) and discussion (lines 453-457) sections to provide a proper theoretical and practical foundation on the use of a MI-based method. Furthermore, we have expanded the results section to add one additional in-silico validation (lines 244-249, Supplementary Figure 5, and updated Supplementary Figure 8) and an in-vitro one (Figure 5, see also reply to comment 2 from reviewer #2), which we believe give strong support to the MI-based method.

    2.The authors claimed that the DCA method requires more computational resources and more time to complete. However, with a proper filtering procedure, the computational time could be reduced heavily. An example is Physical Review E 106 (4), 044409, 2002, in which the DCA was used to investigate the real-time pair-wise interactions (month-to-month). There the DCA results were compared with the correlation analysis. It would be nice to have comparisons of the inferred interactions between MIs and other methods.

    We agree that our MI-based approach should be compared against DCA-based methods. The original manuscript had in fact one such comparison (for the 2023-03 dataset, Figure 3C), which indicated a strong correlation between the two methods. To make this result more robust we have computed the DCA values for the complete time-series dataset and measured the correlation with the MI values (Supplementary Figure 4)

    We observed a relatively high correlation in estimated values between the two methods, with the exception of three time points, i.e., 2020-11, 2023-02 and 2023-03. We can explain these lower correlations with the low overall sequence diversity observed in the early phase of the pandemic (2020-11) and with the different weighting scheme of our approach, which would significantly alter the dataset when compared to the one used by the DCA method, especially towards the later timepoints (see also the reply to reviewer #2, comment 3, section iv). When those three timepoints are excluded, the two methods show a high degree of correlation, implying that they are comparably suitable in detecting coevolutionary signals.

    We have also used the 2nd order coefficients derived from experimental data in Moulana et al., 2022 (10.1038/s41467-022-34506-z) to validate both approaches (see methods, lines 624-631).

    The panels which we have combined to create the new Supplementary Figure 5, indicate how both approaches (MI for panel A and C, and DCA for panels B and D) correctly recover the interaction with 2nd order epistatic coefficient > 0.15, based on the odds-ratio metric. Our MI-based approach has, however, a higher recall across multiple time points, which is especially visible comparing panels A and B. The DCA-based method did correctly identify known epistatic interactions, but did so only in sporadic timepoints, even though the distribution of the underlying variants did not change significantly month to month. We believe that the higher recall of the MI-based method has a higher value for genomic epidemiology, at least for SARS-CoV-2.

    3.In Figure 1C, the authors show that their spydrpick algorithm provides more pairwise MIs for longer distances, where the outliers are denser than those with short distances. How do we explain this phenomenon?

    We thank the reviewer for bringing this point up; we actually think that our data shows the opposite, meaning that we observe a higher proportion of close interactions when normalizing by the number of possible interactions. If we take an arbitrary distance threshold of 1'000 bases to define "close" Vs. "distant" interactions, we observe 194 and 280 interactions, respectively. It is true that distant interactions would be more, but the space of possible interactions is orders of magnitude larger for "distant" interactions, simply by the fact that there are more sites from which interactions can originate. As a crude estimate we can use the combinations between 1,000 sites (499,500 possible interactions) Vs those between 28,903 sites (the full SARS-CoV-2 genome length 29,903 bp minus 1,000, 417,677,253). Based on these estimates we have indeed observed less "close" than "distant" interactions.

    Minor comments:

    4.The explanations of Fig. 1E could be in more detail. Say, the grey dots in Fig. 1E, which is marked as "other" and such "other"s are dominated here. Why?

    We thank the reviewer for pointing out a section where more clarity was needed. We have added the following sentence to the figure legend: "The category "other" indicates positions which are not known to have an impact on affinity to ACE2, immune escape or otherwise flagged as MOI/MOC.". This indicates that predicted interactions involving a site classified as "other" are either false positives or previously undiscovered interactions.

    5.On line 210, the authors mentioned that the weights of the old sequences are lower "at around six months (120 days)". It would be better to specify why six months is 120 days instead of 180 days,

    We have corrected this mistake and indicated 4 months. We thank the reviewer for spotting this error.

    Referees cross-commenting

    I agree with what Reviewer #2 presented in the Consults Comments. The authors should present the reasons why MIs can be explained as the epistatic interations between sites as both of us mentioned this point. I checked the other revision points that raised by the Reviewer #2. They would be definetely helpful for enhancing the quality of the manuscript.

    Reviewer #1 (Significance (Required)):

    The work in the current manuscript is interesting and presented nicely. However, the theoretical foundations that the MIs could be explained as epistatic interactions should be illustrated. Otherwise, the tools would be useful for SARS-CoV-2 and other potential pandemics by different virus.

    Reviewer #2 (Evidence, reproducibility and clarity (Required)):

    The manuscript proposes an approach to identify epistatic interactions in the SRAR-CoV-2 genome using the large amount of genomic data which accumulated during the COVID pandemics. They argue that due to a relatively low computational cost, this can be done online in any ongoing pandemics nowadays (i.e. in the situation where the viral spreading and evolution are closely monitored by massive sequencing). In principle, this is interesting, but in my opinion the manuscript has some strong problems and will require major rewrighting:

    1. In difference to the claims of the manuscript, detected correlation does not necessarily imply epistatic couplings:
    • Even in a totally neutral setting, mutations may occur by chance together, and expand due to genetic drift or when ecountering a susceptible population. Equally, to independent muations may spread in different geographic regions, without the double mutant ever arising. Both cases lead to non-zero mutual information.
    • In evolution, frequently driver and passenger mutations are observed, in particular in settings of relatively high mutation rate. The passenger will rise in frequency with the driver, without any epistatic coupling.
    • The very unequal sequencing across geographic areas will enhance certain variants and leave others undetected. Even if the authors avoid double counting of identical sequences, more small variation is detected when sequencing deeper. The Omicron variant illustrates an extreme case here: it combined a large number of mutations, never detected before, but epistasis is not the most likely explanation, but rather lack of monitoring of the evolutionary path from the ancestral variants to Omicron.
    • MI has been criticised because it overestimates the effect of indirecrt correlations in particular in dense epistatic networks. The situation in the spike protein in Fig. 1B seems very dense.

    Currently the manuscript does not make any effort to disentangle any of these effects.

    Following this (and reviewer 1) comments, we have made a number of changes to the manuscript in order to provide more context into how MI can be used to estimate epistatic interactions and the inherent limitations of this approach. In particular, we have expanded the introduction (lines 93-98), methods (lines 540-543) and discussion (lines 453-457) sections in a way that we believe exposes the limitations of the approach. Despite these limitations, we still believe that a MI-based approach strikes a good balance between speed, ease of implementation, and sensitivity. To further demonstrate this point we have added two additional validations to our results: the first one (in-silico) uses estimated 2nd order epistatic coefficients derived from experimental data (Moulana et al., 2022, 10.1038/s41467-022-34506-z), and the second (in-vitro) our own experimental data on three predicted interactions. The results of the new in-vitro validation have been described in the reply to comment #2 from reviewer 1; in short they show how the MI-based method has comparable sensitivity and specificity as the DCA-based method, and most importantly they allow the recovery of known epistatic interactions across the time period in which they have appeared. The results of the in-vitro validation are discussed in the reply to the next comment from this reviewer, as they directly address the predictive power of our approach: in short, we show how we could also validate these predictions. We think that these new results clearly show how, despite its limitations, the MI-based approach is able to identify bona-fide epistatic interactions, with the advantage of being a simple method to be implemented and with the possibility to be run in real time. For a more detailed discussion of the merits of the MI-based approach over DCA, see the reply to comment #3 from this reviewer.

    1. What are the predictive capacities of the approach? Mutual information is bounded from above by the individual site entropies. So high MI can be detected only in highly mutated sites - i.e. in sides for sure already under monitoring. In fact, the sites in Fig. 1B with many links reflect the overall profile of variant frequencies in single sites (i.e. a totally non-epistatic measure) available on Nextstrain, and extracted from the same data sources.

    The discussion of the results is very anecdotal and it is not clear to me in how far there is any real prediction in the paper, which might surprise and trigger observation or further analyses.

    There is an entire line of related research in estimating and exploiting epistatic couplings in HIV evolution (A Chakraborty, M. Kardar, J. Barton, M MacKay and others) - not cited in the manuscript but relevant for the question how to detect epistatic couplings and what they are good for.

    We thank the reviewer for pointing out relevant literature we had not covered in the original manuscript, and which can be used to indicate how epistatic interaction signals can be leveraged when studying viruses. We have added citations to these studies in the introduction (lines 76-78) to provide a better background for our own study. Regarding the broader concern of showing the predictive power of our approach, we had a similar concern after the manuscript was submitted, and we had already planned a "blind" in-vitro validation to put our approach to the test. In order to make this validation as "blind" as possible, we expanded the dataset to include sequences until August 2023. We then selected interactions within the spike RBD with confidence level O4 in at least the last 4 time points and with one position already flagged as either "affinity", "escape" or "other MOI/MOC"

    We then selected the top three interactions (446-460, 446-486 and 452-490) for our validation, as they have an outlier confidence O4 in at least the 4 time points, and lower or no prediction before. We also added the known 498-501 interaction as a control (Figure 5, panel B)

    We then focused on selecting a set of non-synonymous substitutions to test for their potential epistatic interactions. We decided to select 6 substitutions affecting the 3 predicted interactions based on their frequency in the time points after the cutoff of the original manuscript, shown in Figure 5, panel C.

    Of those, L452R/F490S and G446S/F486V are anti-correlated in their frequency and virtually never observed together in our dataset, G446S/F486S is observed at low frequency (87 samples after 2023-05), and G446S/N460H is virtually never observed (5 samples). We chose the anti-correlated pairs to test the potential of the MI method to explain these "avoidance" phenomenon, and the low frequency pairs as a way to test an early warning system for mutation signatures that might rise in the future. We then planned to test the impact of the individual variants, the double variants, both in the wild-type background and in the Q498R/N501Y background as a crude model for the Omicron variant.

    We then used a pseudovirus assay to test mutated RBDs across two phenotypes: infectivity (i.e. the ability to infect Vero B4 cells) and immune escape (i.e. antibody neutralization curves). We then tested for the presence of epistatic interactions for the double mutants in both backgrounds using a simple linear model (see Methods, lines 711-727). The results of these in-vitro assays are summarized below (Figure 5, panel E for infectivity, F for immune escape).

    Double mutants with a significant (p-value -10) interaction have been highlighted with an asterisk. We confirmed the epistatic interaction for the Q498R/N501H, both for its effect on infectivity and immune escape. For both anti-correlated pairs we found a significant interaction for either the infectivity assay (both) and immune escape (G446S/F486V). In particular, we found that the one hand the G446S/F486V pair induced a large drop in infectivity in the Q498R/N501H background while the double mutant was fairly similar to the immune escape profile of the single G446S variant, thus compensating for the loss of escape shown by the F486V variant alone. We observed the opposite for the L452R/F490S pair in terms of infectivity, with the pair showing a large increase in infectivity in the Q498R/N501H background, an effect we found to be significant. The double mutant had a slightly better immune escape profile than the single mutants, although not significant. From these observations we can hypothesize that the G446S/F486V is anticorrelated for their strong defect in infectivity; we cannot apply the same reasoning for the L452R/F490S pair, whose absence from circulating variants could be ascribed to stochasticity in population dynamics or interactions with other variants. We observed a similar impact of the G446S/F486S and G446S/N460H pairs on infectivity as G446S/F486V; based on these results we could estimate that variants carrying these pairs might have a fitness disadvantage. The inability of unsupervised methods (MI or DCA based) to predict the direction of the effect of course makes it difficult to inform which of the two pairs should be added to a "watchlist", but it would potentially reduce the number of interactions to be tested. We believe that the results of this admittedly small scale in-vitro validation demonstrates the potential of the MI-based approach to flag emerging interactions worthy of further studying. Recent advances in scalability of molecular assays (e.g. 10.1101/2024.03.08.584176) could then be coupled with a real-time system as the one we describe in our manuscript to filter out the more relevant interactions. We have added this forward-looking observation in the discussion as well (lines 465-474).

    1. The authors say that more involved methods like the Direct Coupling Analysis with Pseudolikelihood maximisation would be too slow for the analysis, but several papers show the contrary. The paper by Zeng et al. (Ref. [39]) does so very early in the pandemics in 2020, and another uncited paper of the same authors (Physical Review 2022) uses a nearly identical approach to study the time evolution of epistatic couplings (extractions from Gisaid at several times). As one of theit results, they show that their approach is not only feasible, but delivers more stable results than simpler correlation measures like MI.

    We thank the reviewer for pointing out a relevant reference we had missed in the initial manuscript. At a general level Zeng et al. take a similar approach to what we have described, namely to divide the data according to the isolation date to look for temporal trends. We however see a few differences that we think are in favor of the approach we describe:

    1- Our manuscript covers the time period after the emergence of the Omicron variant, in which epistatic interactions are known and have been characterized and validated experimentally, a crucial requirement for validation. We have also conducted an in-vitro validation on a selected set of predicted interactions (see the reply to the previous comment), which indicates that the method is sound and predictive.

    2- We have prepared a cumulative time-series dataset, meaning that each month introduces new sequences on top of the ones already selected from the previous time points. To the best of our knowledge the Zheng et al. dataset has "insulated" sequences at each month. We believe our approach has the advantage of allowing for a higher recall, as it includes a representation of extinct lineages, which may increase diversity at key loci and thus boost the signal. As described in the original manuscript and in the reply to this reviewer's comments "iv" and "v", we have added a weighting scheme in order to reduce the influence of older sequences and increase the relevance of smaller lineages.

    3- While we have not tested the DCA implementation used by Zeng et al., and we cannot therefore directly comment on its scalability, we have encountered serious limitations when scaling up the popular plmc C implementation developed by the lab of Deborah Marks. In particular we were unable to successfully run it for datasets with more than ~300k sequences, encountering segmentation faults.

    Regarding the third point, while this meant that we could not test the DCA approach on the full dataset, we could still manage to apply it on the time series data, focusing exclusively on the spike (S) gene. As shown above in the reply to reviewer's 1 comment #2, the two methods have a high correlation and are both able to recover known interactions, although with the DCA method having a lower recall. Taken together we believe that the MI-based approach we describe is robust enough to be considered when a tradeoff between speed, ease of implementation and sensitivity has to be struck, which we believe may be the case for a rapid response during a potential future pandemic. We have added more details to the part of the discussion in which the comparison with the DCA-based methods was made to point out how those are still feasible with very large collections of sequences (lines 444-448).

    It would therefore be essential that the authors strongly revise their manuscript to show the relaibility of the results, the predictive value of the predicted couplings, and the originality and robustness of the approach.

    We believe that our response to both reviewers have addressed these concerns, and as a result we have provided a more nuanced view on the use of MI-based methods in the prediction of epistatic interactions in pandemic viruses. Our wording has been modified to make sure that readers interested in replicating our approach are aware of its strengths (speed, ease of implementation) and limitations.

    Furthermore, there are some minor issues in the formulations, which should be corrected

    i) "the virus has differentiated into a number of lineages, almost all of which have taken over the whole population..." This is wrong. SARS-CoV-2 has always been very heterogeneous, with diverse variants circulating (the authors use millions of non-redundant sequences), and only very few have become VOIs or VOCs at some point. This image of competition between multiple coexisting strains is much closer to clonal interference than what the authors describe (even if clonal interference does not rely on population structure, which has always been an important element in COVID).

    We thank the reviewer for pointing out this error in our observation. We have changed "almost all" to "some", which we agree is more accurate.

    ii) The authors say that pseudolikelihood methods would require "aggressive subsampling". This is not true, in machine learning massive training data are frequently used in the context of batch learning, i.e. in each learning epoch a "batch" is sampled from the full data. This leads to stochasticity in learning, but all data are eventually used.

    We have reformulated this sentence (lines 85-90) to indicate how batch learning could also be used to make certain methods scalable, with the caveat that they would be more complicated to implement.

    iii) The authors say that the download also a phylogenetic tree, but I do not see where it is used.

    As indicated in the methods section, we have used the phylogenetic tree for two purposes:

    1- To single out high quality sequences from the raw MSA (line 515)

    2- To compute the weight of each sequence in the final MSA, as described in line 540-549

    iv)The authors use sequence weights as implemented in Ref. [31]. There a weighting at sequence similarity threshold of 90% is used. I would expect that there are no SARS-CoV-2 genomes having accumulated more than 10% of nucleotide mutations, i.e. the weighting procedure would be without any effect.

    We realized that the sequence weighting scheme we have used is not described in Pensar et al. (10.1093/nar/gkz656), but rather in the implementation of the spydrpick algorithm used by the panaroo software (Tonkin-Hill et al., 10.1186/s13059-020-02090-4). This weighting scheme is based on the more granular metric that is the patristic distance of each sequence from the root of the tree, divided at each branching point by the number of its terminal leaves. In practical terms this means that sequences belonging to smaller lineages (i.e. with fewer observed samples) will have a larger weight, regardless of a discrete sequence similarity threshold, as was done in the original implementation. We have updated the methods section to clearly indicate that the weighting scheme is that first shown in the panaroo software package (line 543).

    v)The authors estimate that they need 10,000-100,000 sequences to estimate MI, but find the epistatic coupling in spike residues 498-501 as soon as 6 double mutants are present, which is a frequency of about 1e-4. The corresponding entropies should be low and in consequence the MI, too.

    We thank the reviewer for raising this point, which prompted us to devise a way to better illustrate the sequence weighting scheme we have used. As a side note we also discovered that the number of Omicron sequences at the 2021-11 was actually 7, and not 6 as stated throughout the original manuscript, an error we have now fixed. As described in the methods section we have combined two weights in the time-series analysis: the first one, described in the response to the previous comment, is based on the "density" of the phylogenetic tree, which deflates the contribution of "denser" regions of the tree, and the second reduces the relevance of older sequences. The two weights are then combined multiplicatively. As a result the "real" (i.e. effective) number of sequences harboring a particular double mutation will be different than by just counting their occurrences.

    As shown in Supplementary Figure 3, the combination of both weights (first column) leads to an increased effective number of sequences for "younger" samples and those that come from "sparser" regions of the overall phylogenetic tree. This is particularly evident for the middle row (2021-11); the light orange dot, which indicates sequences belonging to the first Omicron lineage to appear in the dataset (BA.1), has an actual N of 7, but an effective N of ~100 (exact value 86), thanks to its "novelty" both in the tree (middle panel) and in terms of time (right panel). We again thank the reviewer for raising this point, which led us to generate this visualization, which will hopefully clarify the rationale for the weighting strategy we have used for moist readers.

    vi)The authors say that the public health toll of COVID has been "balanced" by scientific discovery - I would urge the authors to avoid such formulations, which sound cynical.

    We agree with the reviewer that this comment might sound cynical and tone-deaf, and have reformulated to indicate that the impact of the pandemic has coincided with an accelerated pace of applied scientific discovery.

    Referees cross-commenting

    Both reports bring up very similar points (points 1 of both reports, point 2 of Reviewer #1 vs. my point 3) but add partially complementary questions (point 3 of Reviewer #1, my point 2), both related to the interpretation of the data. My report is more severe, but reading the ms I am convinced that the paper requires serious revision. So reports seem coherent but with different degrees of recommendations. However, none of the comments of one reviewer is contradiction to the other reviewer.

    Reviewer #2 (Significance (Required)):

    While the paper asks interesting questions and wants to make use of the quite unique data which have accumulated during the COVID pandemics, the above mentioned problems raise important questions about the manuscript. It would be essential that the authors strongly revise their manuscript to show the relaibility of the results, the predictive value of the predicted couplings, and the originality and robustness of the approach.

  2. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #2

    Evidence, reproducibility and clarity

    The manuscript proposes an approach to identify epistatic interactions in the SRAR-CoV-2 genome using the large amount of genomic data which accumulated during the COVID pandemics. They argue that due to a relatively low computational cost, this can be done online in any ongoing pandemics nowadays (i.e. in the situation where the viral spreading and evolution are closely monitored by massive sequencing). In principle, this is interesting, but in my opinion the manuscript has some strong problems and will require major rewrighting:

    1. In difference to the claims of the manuscript, detected correlation does not necessarily imply epistatic couplings:
    • Even in a totally neutral setting, mutations may occur by chance together, and expand due to genetic drift or when ecountering a susceptible population. Equally, to independent muations may spread in different geographic regions, without the double mutant ever arising. Both cases lead to non-zero mutual information.
    • In evolution, frequently driver and passenger mutations are observed, in particular in settings of relatively high mutation rate. The passenger will rise in frequency with the driver, without any epistatic coupling.
    • The very unequal sequencing across geographic areas will enhance certain variants and leave others undetected. Even if the authors avoid double counting of identical sequences, more small variation is detected when sequencing deeper. The Omicron variant illustrates an extreme case here: it combined a large number of mutations, never detected before, but epistasis is not the most likely explanation, but rather lack of monitoring of the evolutionary path from the ancestral variants to Omicron.
    • MI has been criticised because it overestimates the effect of indirecrt correlations in particular in dense epistatic networks. The situation in the spike protein in Fig. 1B seems very dense.

    Currently the manuscript does not make any effort to disentangle any of these effects.

    1. What are the predictive capacities of the approach? Mutual information is bounded from above by the individual site entropies. So high MI can be detected only in highly mutated sites - i.e. in sides for sure already under monitoring. In fact, the sites in Fig. 1B with many links reflect the overall profile of variant frequencies in single sites (i.e. a totally non-epistatic measure) available on Nextstrain, and extracted from the same data sources.

    The discussion of the results is very anecdotal and it is not clear to me in how far there is any real prediction in the paper, which might surprise and trigger observation or further analyses. There is an entire line of related research in estimating and exploiting epistatic couplings in HIV evolution (A Chakraborty, M. Kardar, J. Barton, M MacKay and others) - not cited in the manuscript but relevant for the question how to detect epistatic couplings and what they are good for.

    1. The authors say that more involved methods like the Direct Coupling Analysis with Pseudolikelihood maximisation would be too slow for the analysis, but several papers show the contrary. The paper by Zeng et al. (Ref. [39]) does so very early in the pandemics in 2020, and another uncited paper of the same authors (Physical Review 2022) uses a nearly identical approach to study the time evolution of epistatic couplings (extractions from Gisaid at several times). As one of theit results, they show that their approach is not only feasible, but delivers more stable results than simpler correlation measures like MI.

    It would therefore be essential that the authors strongly revise their manuscript to show the relaibility of the results, the predictive value of the predicted couplings, and the originality and robustness of the approach.

    Furthermore, there are some minor issues in the formulations, which should be corrected

    i) "the virus has differentiated into a number of lineages, almost all of which have taken over the whole population..." This is wrong. SARS-CoV-2 has always been very heterogeneous, with diverse variants circulating (the authors use millions of non-redundant sequences), and only very few have become VOIs or VOCs at some point. This image of competition between multiple coexisting strains is much closer to clonal interference than what the authors describe (even if clonal interference does not rely on population structure, which has always been an important element in COVID).

    ii) The authors say that pseudolikelihood methods would require "aggressive subsampling". This is not true, in machine learning massive training data are frequently used in the context of batch learning, i.e. in each learning epoch a "batch" is sampled from the full data. This leads to stochasticity in learning, but all data are eventually used.

    iii) The authors say that the download also a phylogenetic tree, but I do not see where it is used.

    iv)The authors use sequence weights as implemented in Ref. [31]. There a weighting at sequence similarity threshold of 90% is used. I would expect that there are no SARS-CoV-2 genomes having accumulated more than 10% of nucleotide mutations, i.e. the weighting procedure would be without any effect.

    v)The authors estimate that they need 10,000-100,000 sequences to estimate MI, but find the epistatic coupling in spike residues 498-501 as soon as 6 double mutants are present, which is a frequency of about 1e-4. The corresponding entropies should be low and in consequence the MI, too.

    vi)The authors say that the public health toll of COVID has been "balanced" by scientific discovery - I would urge the authors to avoid such formulations, which sound cynical.

    Referees cross-commenting

    Both reports bring up very similar points (points 1 of both reports, point 2 of Reviewer #1 vs. my point 3) but add partially complementary questions (point 3 of Reviewer #1, my point 2), both related to the interpretation of the data. My report is more severe, but reading the ms I am convinced that the paper requires serious revision. So reports seem coherent but with different degrees of recommendations. However, none of the comments of one reviewer is contradiction to the other reviewer.

    Significance

    While the paper asks interesting questions and wants to make use of the quite unique data which have accumulated during the COVID pandemics, the above mentioned problems raise important questions about the manuscript. It would be essential that the authors strongly revise their manuscript to show the relaibility of the results, the predictive value of the predicted couplings, and the originality and robustness of the approach.

  3. Note: This preprint has been reviewed by subject experts for Review Commons. Content has not been altered except for formatting.

    Learn more at Review Commons


    Referee #1

    Evidence, reproducibility and clarity

    Summary The authors inferred the pairwise epistasis through the Mutual Information provided by the spydrpick algorithm. They claim that the MIs could serve as a real-time identification of the epistatic interactions with the SARS-CoV-2 genomes due to the fast inference and high sensitivities.

    Major comments:

    1. The authors take a data-driven approach to infer the Mutation Information as the epistatic interactions between the mutations over different sites over SARS-CoV-2 genomes. However, it would be better to specify why this metric is reliable to be used as the representation of the pairwise epistatic interactions, and any theoretical explanations to support this.
    2. The authors claimed that the DCA method requires more computational resources and more time to complete. However, with a proper filtering procedure, the computational time could be reduced heavily. An example is Physical Review E 106 (4), 044409, 2002, in which the DCA was used to investigate the real-time pair-wise interactions (month-to-month). There the DCA results were compared with the correlation analysis. It would be nice to have comparisons of the inferred interactions between MIs and other methods.
    3. In Figure 1C, the authors show that their spydrpick algorithm provides more pairwise MIs for longer distances, where the outliers are denser than those with short distances. How do we explain this phenomenon?

    Minor comments: 4.The explanations of Fig. 1E could be in more detail. Say, the grey dots in Fig. 1E, which is marked as "other" and such "other"s are dominated here. Why? 5.On line 210, the authors mentioned that the weights of the old sequences are lower "at around six months (120 days)". It would be better to specify why six months is 120 days instead of 180 days,

    Referees cross-commenting

    I agree with what Reviewer #2 presented in the Consults Comments. The authors should present the reasons why MIs can be explained as the epistatic interations between sites as both of us mentioned this point. I checked the other revision points that raised by the Reviewer #2. They would be definetely helpful for enhancing the quality of the manuscript.

    Significance

    The work in the current manuscript is interesting and presented nicely. However, the theoretical foundations that the MIs could be explained as epistatic interactions should be illustrated. Otherwise, the tools would be useful for SARS-CoV-2 and other potential pandemics by different virus.