Karyon: a computational framework for the diagnosis of hybrids, aneuploids, and other non-standard architectures in genome assemblies

This article has been Reviewed by the following groups

Read the full article

Abstract

Recent technological developments have made genome sequencing and assembly accessible to many groups. However, the presence in sequenced organisms of certain genomic features such as high heterozygosity, polyploidy, aneuploidy, or heterokaryosis can challenge current standard assembly procedures and result in highly fragmented assemblies. Hence, we hypothesized that genome databases must contain a non-negligible fraction of low-quality assemblies that result from such type of intrinsic genomic factors. Here we present Karyon, a Python-based toolkit that uses raw sequencing data and de novo genome assembly to assess several parameters and generate informative plots to assist in the identification of non-chanonical genomic traits. Karyon includes automated de novo genome assembly and variant calling pipelines. We tested Karyon by diagnosing 35 highly fragmented publicly available assemblies from 19 different Mucorales (Fungi) species. Our results show that 6 (17%) of the assemblies presented signs of unusual genomic configurations, suggesting that these are common, at least within the Fungi.

Article activity feed

  1. AbstractRecent technological

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac088), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer name: Kamil S. Jaron

    Assembling a genome using short reads quite often cause a mixed bag of scaffolds representing uncollapsed haplotypes, collapsed haplotypes (i.e. the desired haploid genome representation) and collapsed duplicates. While there are individual software for collapsing uncollapsed haplotypes (e.g. HaploMerger2, or Redundans), there is no established workflow or standards for quality control of finished assemblies. Naranjo-Ortiz et al. describes a pipeline attempting to make one.

    The Karyon pipeline is a workflow for assembling haploid reference genomes, while evaluating the ploidy levels on all scaffolds using GATK for variant calling and nQuire for a statistical method for estimating of ploidy from allelic coverage supports. I appreciated the pipeline promotes some of good habits - such as comparing k-mer spectra with the genome assembly (by KAT) or treatment of contamination (using Blobtools). Nearly all components of the pipeline are established tools, but authors also propose karyon plots - diagnostic plots for quality control of assemblies.

    The most interesting and novel one I have seen is a plot of SNP density vs coverage. Such plot might be helpful in identifying various changes to ploidy levels specific to subset of chromosome, as authors demonstrated on the example of several fungal genomes (Mucorales). I attempted to run the pipeline and run in several technical issues. Authors, helped me overcoming the major ones (documented here: https://github.com/Gabaldonlab/karyon/issues/1) and I managed to generate a karyon plot for the genome of a male hexapod with X0 sex determination system. I did that, because we know well the karyotype and I suspected, the X chromosome will nicely pop-up in the karyon plot.

    To my surprise, although I know the scaffold coverages are very much bi-modal, I got only a single peak of coverages in the karyon plot and oddly somewhere in between the expected haploid and diploid coverages. I think it is possible I have messed up something, but I would like authors to demonstrate the tool on a known genome with known karyotype. I would propose to use a male of a species with XY or X0 sex determination system. Although it's not aneuploidy sensu stricto, it is most likely the most common within-genome ploidy variation among metazoans. I would also propose authors to improve modularity of the pipeline. On my request authors added a lightweighted installation for users interested in the diagnostic plots after the assembly step, but the inputs are expected in a specific, but undocumented format, which makes a modular use rather hard. At least the documentation of the formats should improve, but in general I think it could be made more friendly to folks interested only in some smaller bits (I am happy to provide authors with the data I used).

    Although I quite enjoyed reading the manuscript and the manual afterwards, I do think there is a lot of space for improvement. One major point is there is no formal description of the only truly innovative bit of this pipeline - the karyon plots. There is a nice philosophical overview, but the karyon plots are not explained in particular, which makes reading of the showcase study much harder. Perhaps a scheme showing the plot and annotating what is expected where would help. Furthermore, authors did a likelihood analysis of ploidy using nQuire, but they did not talk about it at all in the result section. I wonder, what's the fraction of the assembly the analysis found most likely to be aneuploid for the subset of strains that suspected to be aneuploids? Is 1000 basis sliding window big enough to carry enough signal to produce reliable assignments? In my experience, windows of this size are hard to assign ploidy to, but I usually do such analyses using coverage, not SNP supports.

    However, I would like to appraise authors for the fungal showcases, I do think they are a nice genomics work, investigating and considering both biological and technical aspects appropriately. Finally, a bit smaller comment is that the introduction could a bit more to the point. Some of the sections felt a bit out of place, perhaps even unnecessary (see minor comments bellow). More specific and minor comments are listed bellow. Kamil S. Jaron

    Minor manuscript comments: I gave this manuscript a lot of thought, so I would like to share with you what I have figured out. However, I recognise that these writing comments listed bellow are largely matter of personal preference. I hope they will be useful for you, bit it is nothing I would like to insist on as a reviewer. l56: An unnecessary book citation. It's not a primary source for that statement and if a reference was made a "further reading", perhaps better to cite a recent review available online rather than a book. l65 - 66: Is the "lower error rate" still a true statement? I don't think it is, error rates of HiFi reads are similar or even lower compared to short reads. (tough I do agree there is still plenty of use for short reads). l68 - 72: I don't think you really need this confusing statement " which are mainly influenced by the number of different k-mers", the problems of short read assembly are well explained bellow. However, I actually did not understand why the whole paragraph l76 - 88 was important. I would expect an introduction to cover approaches people use till now to overcome problems of ploidy and heterozygosity in assemblies. l176 - 177: "Ploidy can be easily estimated with cytogenetic techniques" - I don't think this statement is universally true. There are many groups where cytogenetics is extremely hard (like notoriously difficult nematodes) or species that don't cultivate in the lab. For those it's much easier to do NGS analysis. You actually contradict this "easily" right in the next sentence. l191: the first autor of nQUire is not Weib, but Weiß. The same typo is in the reference list. l222 - 223: and l69-70 explains what is a k-mer twice. l266 - 267: This statement or the list does not contain references to publications sequencing the original genomes. I am not sure, but when possible, it is good to credit original authors for the sequencing efforts. l302: REF instead of a reference l303: What is "important fraction"? l304: How can you make such a conclusion? Did you try to remove the contamination and redo the assembly step? Did the assembly improve? Not sure if it's so important for the manuscript, but I would tone down this statement ("could be caused by" sounds more appropriate). l310: "B9738 is haploid" are you talking about the genome or the assembly? How could you tell the difference between homozygous diploid and haploid genome? If there is a biological reason why homozygous diploid is unlikely, it should be mentioned. l342: How fig 7 shows 3% heterozygosity? How was the heterozygosity measured? Also, karyon plot actually shows that majority of the genome is extremely homozygous and all heterozygosity is in windows with spuriously high coverage. What do you think is the haploid / diploid sequencing coverage in this case? l343 - 345: I don't think these statements are appropriately justified. The analysis presented did not convincingly show the genome is triploid or heterozygous diploid. l350: I think citing SRA is rather unnecessary. l358: what "model"? How could one reproduce the analysis / where could be the model found? l378 - 379: Does Karyon analyse ploidy variation "during" the assembly process? Although the process is integrated in a streamlined pipeline, there are loads of approaches to detect karyotype changes in assemblies, from nQuire which is used by Karyon, through all the sex-chromosome analyses, such as https://journals.plos.org/plosbiology/article?id=10.1371/journal.pbio.1002078.

    Method/manual comments:

    Scaffold length plots have no label of the x axis. As the plots are called distributions, I would expect frequency or probability on the y axis and the scaffold length on the x. Furthermore, plotting of my own data resulted in a linnear plot with a very overscaled y-axis. "Scaffold versus coverage" plot also does not have axis labels either. I would also call it scaffold length vs coverage instead. I also found the position of the illustrating picture in the manual confusing a bit (probably should be before the header of the next plot).

    Variation vs. coverage is the main plot. It does look as a useful visualisation idea. Do I understand right that it's just numbers of SNPs vs coverage? I am confused as I thought the SNP calling is done on the reference individual and in the description you talk about homozygous variants too, what are those? Missmapped reads? Misassembled references?

    I also wonder about "3. Diffuse cloud across both X and Y axes.", I would naturally imagine that collapsed paralogs would have a similar pattern to the plot that was shown as an example - a smear towards both higher coverage and SNP density. I guess this is a more general comment, would you expect any different signature of collapsed paralogs and higher ploidy levels? Should not paralogy be more explicitly considered as a factor?

  2. Recent tec

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac088), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows: **Reviewer name: Michael F. Seidl **

    The technical note 'Karyon: a computational framework for the diagnosis of hybrids, aneuploids, and other non-standard architectures in genome assemblies' by Naranjo-Ortiz and colleagues reports on the development and application of the Karyon framework. Karyon is a python-based toolkit that utilizes several software tools developed by the authors' and/or others with the overall aim to assess sequencing data and genome assemblies for potential assembly artefacts caused by a plethora of different features intrinsic to the analyzed species/strain. Karyon is publicly available from github and as a docker image.

    Genome assemblies are nowadays important tools to develop novel biological hypotheses. However, genome assemblies are often not ideal, i.e., they are highly fragmented and/or incomplete, which can significantly hamper their full exploitation. The genome assembly quality is impacted by different biological factors that can be, at least partially, discovered directly based on the raw sequencing data and from the genome assembly (e.g., allele frequency, k-mer profiles, coverage depth, etc.). There are already plenty of established computational tools available to perform these type of analyses (to name a few: KAT, genomscope, nQuire).

    Karyon will ease these analyses by providing a single computation framework that combines different and complex software tool and generates diagnostic figures to support biological interpretation. Karyon thus represents a valuable contribution to the scientific community. The Karyon toolkit is built around established software tools and the overall methodology is sound and suitable to assess genome qualities. The interpretation of the results of Karyon is on the user, which still necessitates expert knowledge to correctly interpret signals.

    While examples are provided in the manual, the level of experience required will likely hamper the full exploitation of the pipeline by not expert users. Furthermore, it can be anticipated that expert users already employ the separate software to study genome complexities, and thus might not be in full need for Karyon. Obviously, this is inherent to the problem at hand and cannot be easily addressed by the authors. However, I would like to encourage the authors to further improve the manual and the examples to guide the data interpretation with the aim to make this software as accessible to as many researchers as possible.

    I nevertheless also have some comments related to the data presented in the manuscript that the authors need to address. First, the introduction finishes by asserting that different biological factors are expected to impact published genome assemblies. Furthermore, the manuscript mentions that quality of fungal genomes is often sub-optimal. However, no evidence for these statements is provided. To strengthen this point and to further highlight the urgency of methods to discover and ultimately address these problems, the authors need to provide a more systematic analyses based on publicly available genome assemblies for the occurrence of compromised genome assemblies. For example, a random subset of genome sequences for different eukaryotic phyla and / or classes, and more systematic throughout the fungi, would

    i) significantly substantiate the manuscript's message and

    ii) confirm the applicability of the authors' framework to most eukaryotes and not only to specific fungal groups (Mucorales).

    Second, the table mentions the diagnosis derived from Karyon but simply mentions 'unknown' for most entries. Based on the manuscript is seem that these are supposedly haploid with very little heterozygosity (L279) but table 1 nevertheless reports for most species/strains strikingly different genome size estimates between the original and the Karyon-derived genome assemblies (Karyon is consistently smaller). The authors need to explain in much more depth the nature of these differences for the reported genomes. For instance, it could be that publicly deposited assemblies have been generated by a combination of different sequencing libraries and technologies that are not fully exploited by Karyon. Third, one additional measure often applied to assess genome quality is genome completeness as for instance assayed by BUSCO. Karyon should include as strategy such as BUSCO to

    i) assess the occurrence of marker genes in the genome assemblies and

    ii) the duplication level of these genes as this might reveal un-collapsed alleles etc. Especially the latter is important to interpret genome size differences between original and Karyon-derived genome assemblies.

    Further detailed comments and suggestions to improve the manuscript: L21: could the authors please specify what 'groups' they refer to? L22: there seems to be an extra space L59: could the authors please specify what they mean with a 'poor assembly'. What is poor in terms of genome assembly? Contiguity or completeness, or unresolved haplotypes, or …, or a combination of thereof? L63-: the authors only once refer explicitly to Fig 1 in this section. the manuscript would be clearer if they would refer to specific panels as they describe factors impacting genome assembly quality L66: could the authors please further substantiate their notion that most genome assemblies publicly available are formed by short-read sequencing data. This information should be readily available at NCBI and/or GOLD

    L119: the manuscript mentions pan-genomics, but the relevance of aneuploidy in these studies is not explain. The manuscript should provide a brief explanation for the importance of aneuploidy (or any form of ploidy shift) for pan-genomics L147: 'From' -> 'from' L148: 'Symbiotic' -> 'symbiotic' L232: the reference to nQuire should read Weiß et al. 2018. L302: the reference to blobtools is missing L349: To initiate the pipeline, was a single sequencing library or a combination of multiple libraries used? Table 1: The table formatting, at least in the combined pdf, seems to be broken.

  3. Abstract

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giac088), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer name: Zhong Wang

    In this work, Naranjo-Ortiz et al. presented a software pipeline that is capable of de novo genome assembly, variant calling, and generating diagnostic plots. Applying this software to 35 publically available, highly fragmented fugal genome assemblies revealed prevalent inconsistencies between the sequencing data and the assembly. I really appreciate the authors' effort to make their software, Karyon, easy to use by providing multiple ways to install and a detailed software manual. I especially like the detailed explanation of how to use the diagnostic plots to infer the "nonstandard genome architectures".

    The manuscript is clearly written and very easy to follow. I have the following general comments:

    1. It wasn't clear to me the relationships between the raw sequencing data and the assembly -- were they belong to the same isolate? If so, then the inconsistencies may reflect assembly errors in the fungal genome assembly. Have the authors rule our this possibility? The fact that these genomes are highly fragmented suggests they likely contain many errors. If they were from different isolates, then I agree with the authors that the diagnostic plots could be examined carefully to detect structural variations. For that, have the authors used any alternative method to validate at least some of their findings? To establish the validity of their approach, it would be more convincing to obtain the same findings using independent approaches, including experimental ones.

    2. Given the raw WGS reads and assembled genome, another software, QUAST (http://quast.sourceforge.net/), automatically detect assembly errors and structural variations. It would be interesting to see a comparison between the findings via Karyon and via Quast.

    3.This is an optional suggestion, as I realize it may not be easy to implement. The biggest limitation of Karyon is that it does not automatically detect these usual genome organization. It may be possible by comparing the de novo assemblies produced by Karyon to the reference genomes. At least such possibilities should be discussed.