Accurate and Fast Clade Assignment via Deep Learning and Frequency Chaos Game Representation

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Background

Since the beginning of the COVID-19 pandemic there has been an explosion of sequencing of the SARS-CoV-2 virus, making it the most widely sequenced virus in the history. Several databases and tools have been created to keep track of genome sequences and variants of the virus, most notably the GISAID platform hosts millions of complete genome sequences, and it is continuously expanding every day. A challenging task is the development of fast and accurate tools that are able to distinguish between the different SARS-CoV-2 variants and assign them to a clade.

Results

In this paper, we leverage the Frequency Chaos Game Representation (FCGR) and Convolutional Neural Networks (CNNs) to develop an original method that learns how to classify genome sequences that we implement into CouGaR-g, a tool for the clade assignment problem on SARS-CoV-2 sequences. On a testing subset of the GISAID, CouGaR-g achieves an 96.29% overall accuracy, while a similar tool, Covidex, obtained a 77, 12% overall accuracy. As far as we know, our method is the first using Deep Learning and FCGR for intra-species classification. Furthermore, by using some feature importance methods CouGaR-g allows to identify k -mers that matches SARS-CoV-2 marker variants.

Conclusions

By combining FCGR and CNNs, we develop a method that achieves a better accuracy than Covidex (which is based on Random Forest) for clade assignment of SARS-CoV-2 genome sequences, also thanks to our training on a much larger dataset, with comparable running times. Our method implemented in CouGaR-g is able to detect k -mers that capture relevant biological information that distinguishes the clades, known as marker variants.

Availability

The trained models can be tested online providing a FASTA file (with one or multiple sequences) at https://huggingface.co/spaces/BIASLab/sars-cov-2-classification-fcgr . CouGaR-g is also available at https://github.com/AlgoLab/CouGaR-g under the GPL.

Article activity feed

  1. Background

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac119), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    **Reviewer 1: Dominik Heider **

    The paper is well written, and the objectives are clear. The study is a very nice application of CGR in bioinformatics and shows the excellent performance of CGR-encoded data in combination with deep learning. I have a few things that should be addressed in a minor revision:

    1. Some very important studies have not been addressed in the related work part, e.g., in Touati et al. (pubmed:32645523) and Sengupta et al. (pubmed:32953249), the authors compared SARS-CoV2 with other coronaviruses based on CGR, or we (pubmed:34613360) used CGR in combination with deep learning for resistance predictions in E. coli.

    2. To me, it is unclear how accuracy was used in the model. Is it one class (i.e., clade) versus all others? If yes, accuracy might be misleading because of the high class imbalance. In such high class imbalances, MCC has been shown to be more suitable.

    3. "The undersampled dataset was randomly split into train...". Why did you undersample? To balance the data, which would make sense to use accuracy as a metric but discard a lot of valuable data. What about oversampling?

    4. Comparison with other tools: I wonder whether the good performance of your model is the result of deep learning or the CGR encoding. Please also provide the results for another ML model (besides SVM, e.g., random forests) to compare to, e.g., Covidex.

    **Reviewer 2: Riccardo Rizzo **

    The authors propose a classification experiment based on Frequency Chaos Game Representation and deep learning. They used the outstanding performances of a ResNet network as an image classification tool and the FCGR method that represent a genome sequence as an image.

    The work seems good, although some major points should be clarified.

    First, whether the performance index values came from a 5-fold validation procedure (5 because they said the split was 80-10-10) or a one-shot experiment is unclear.

    Second, the part that involves the frequent k-mers and the SVM should be better explained. The authors should clarify what the meaning of this comparison is.

    Another point to clarify is the quality of the sequences used; the authors worked on complete sequences, but, as far as I know, in the real world virus sequences are noisy data, and authors should discuss this point.

    Minor points:

    • Authors said that a sequence is a string $s \in {A, C, G, T, N}^*$, so they should explain the procedure used in Definition 2, where only 4 symbols seem to be used. If they discard the N, or consider 4 k-mers (consider that N means "any symbol") they should say it clearly.
    • Figure 1 and 2 report two different quantities but say the same thing; maybe one of them can be omitted.
    • Authors should add some details about the training time of the network.

    A final suggestion: probably it will be interesting to use the same deep network with transfer learning (the whole network or just the first sections) to evaluate the gain with ad-hoc training and the different training time.