The Regulatory Mendelian Mutation score for GRCh38

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Motivation

Various genome sequencing efforts for individuals with rare Mendelian disease have increased the research focus on the non-coding genome and the clinical need for methods that prioritize potentially disease causal non-coding variants. Some methods and annotations are not available for the current human genome build (GRCh38), for which the adoption in databases, software and pipelines was slow.

Results

Here, we present an updated version of the Regulatory Mendelian Mutation (ReMM) score, re-trained on features and variants derived from the GRCh38 genome build. Like its GRCh37 version, it achieves good performance on its highly imbalanced data. To improve accessibility and provide users with a toolbox to score their variant files and lookup scores in the genome, we developed a website and API for easy score lookup.

Availability and Implementation

Pre-scored whole genome files of GRCh37 and GRCh38 genome builds are available on Zenodo https://doi.org/10.5281/zenodo.6576087 . The website and API are available at https://remm.bihealth.org .

Article activity feed

  1. Motivation

    Reviewer 3: Wyeth Wasserman

    SYNOPSIS The manuscript describes an updated release of the ReMM regulatory variant mutation scoring system. The paper presents the performance of an updated version of the system and describes how it was applied to the most current release of the reference human genome.

    OVERALL PERSPECTIVE This is a valuable resource for the community of researchers and clinicians working on the interpretation of genetic variants in the human genome. The work appears to be thoughtfully done and appropriate assessments have been provided. The use of the random forest models to weigh the contributions of features was particularly noted for the insights it provided into how features contribute to prediction. My biggest concerns are stylistic, which falls outside the scientific quality of the work. I provide these comments for the authors to consider and do not expect that my stylistic preferences will be uniformly accepted. A fair amount of justification of the manuscript focuses on the value of having a release for version 38 of the human genome, pointing to the field as not having done so broadly. I think this is misguided, as by the time people are reading the manuscript such points will have lost relevance. I suggest a focus on the science be given, as there is no need to justify things based on where other resources have progressed in releasing their version 38 updates. Points below include language/text clarifications that can be assessed by the authors. Writing styles differ, so stylistic comments should be optional.

    MAJOR POINTS None. Well done and clearly presented.

    MINOR POINTS

    1. The word "various" is vague and often shows up when people are too busy to provide an accurate statement. Starting the manuscript with it makes a bad impression on this reader. You do not have to change it, but I thought you might appreciate knowing this impression. You could delete it with no harm to the sentence. (Not to get carried away, but the next sentence starting with "some" heightens the impression of 'hand waving'.)
    2. I think I understand ", we apply cytogenic band-aware cross-validation using ten folds" but I encourage the authors to provide clearer wording for this point.
    3. I would allow the reader to make their own judgement of performance. So please remove "excellent" from "we achieve an excellent performance"
    4. "Rather than using ReMM scores for ranking, some users need to specify score thresholds" is confusing. I would change 'need to' to 'choose to'
    5. "with lots of false positives" is a bit informal. I suggest "with a high false positive rate"
    6. I am confused by "from three genomic regions (genic content and not overlapping with assembly gap changes) " as the brackets include two items, not three.
    7. "maybe due to better mapping" - "maybe" should be "may be"
    8. I think the language like "seems to be the only tool directly trained on training data and features derived from GRCh38." Is not particularly valuable long term. This is a useful contribution, but many tools are being updated to 38 and by the time this appears and is read, such statements decline in relevance. I would focus on providing this valuable resource, and not try to justify it based on a transient perception of where the field stands in updating versions.
    9. "It is worth noting that in the context of extremely unbalanced data…" - you do note it. So I would change the wording to "In the context of extremely unbalanced data…"
  2. Motivation

    Reviewer 2: Mulin Jun Li

    In this manuscript, the authors updated their previous ReMM to the GRCh38 human genome build, supported convenient and fast data source. Then, the authors take some examples to demonstrate the usability of the resource. It's original to point that the difference in prioritized tools between different genome build. However, we have following concerns and comments:

    Major:

    1. How to deal with missing value variants in test datasets when compare new ReMM with other tools, the author mentioned that ExPecto annotated only half of the million negative variants.
    2. Although the CADD used the same negative training dataset, it's not suitable to compare it in the ReMM training dataset. How those tools performance in the independent test datasets.
    3. The author presumes that new genome build will get better performance, is there some evidence can support this perspective, like the distribution of feature or training data in different genome build.
    4. Other existing similar tools can prioritization disease-causal noncoding variant, such as regBase-PAT, NCBoost, ncER, etc. can the authors compare new version of ReMM with these tools.
  3. The Regulatory Mendelian Mutation score for GRCh38

    Reviewer 1: Yan Guo

    1. In the abstract "Some methods and annotations are not available for the current human genome build (GRCh38), for which the adoption in databases, software and pipelines was slow." Not sure what the author is referring to by some methods, this could be a grammar problem.
    2. "Restricting variants to non-coding only removes a small proportion of variants", what is the proportion? Also, I don't understand the need to remove coding variants, shouldn't your model works also with coding variants?
    3. The method the author used is based on a previous publication. However, there is still the need to give the detail of the method in this manuscript. There is a lot of missing information. For example, what is the outcome, whether a position is deleterious? How is the probability for deleteriousness calculated?
    4. by a few specific variants. Thus, the overall Mendelian disease-related variants should be low. I am guessing that's why 406 hand-curated variants were used in the previous version of ReMM. If my assumption is correct, there shouldn't be a lot variants for Mendelian disease. How many variants are found to be positive in the entire genome?
    5. In the online application, the results are limited to 500, the rest cannot be seen or downloaded. I would be better to allow the user to download the entire results.
    6. The authors performed comparison with other tools and generated ROC curve which is dependent on knowing the true positives. There is no description of the dataset that was used for the comparison. Did the authors make sure that the training variants is not used for the comparison?