Highly contiguous assemblies of 101 drosophilid genomes

Curation statements for this article:
  • Curated by eLife

    eLife logo

    Evaluation Summary:

    Drosophila species have long served as an important model system for genetics and genomics. The authors have developed an important community resource of high standard genomes for many species across the Drosophila clade. This resource will serve to empower the next generation of Drosophila research and provides an important road map for similar efforts in other groups of organisms.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Over 100 years of studies in Drosophila melanogaster and related species in the genus Drosophila have facilitated key discoveries in genetics, genomics, and evolution. While high-quality genome assemblies exist for several species in this group, they only encompass a small fraction of the genus. Recent advances in long-read sequencing allow high-quality genome assemblies for tens or even hundreds of species to be efficiently generated. Here, we utilize Oxford Nanopore sequencing to build an open community resource of genome assemblies for 101 lines of 93 drosophilid species encompassing 14 species groups and 35 sub-groups. The genomes are highly contiguous and complete, with an average contig N50 of 10.5 Mb and greater than 97% BUSCO completeness in 97/101 assemblies. We show that Nanopore-based assemblies are highly accurate in coding regions, particularly with respect to coding insertions and deletions. These assemblies, along with a detailed laboratory protocol and assembly pipelines, are released as a public resource and will serve as a starting point for addressing broad questions of genetics, ecology, and evolution at the scale of hundreds of species.

Article activity feed

  1. Evaluation Summary:

    Drosophila species have long served as an important model system for genetics and genomics. The authors have developed an important community resource of high standard genomes for many species across the Drosophila clade. This resource will serve to empower the next generation of Drosophila research and provides an important road map for similar efforts in other groups of organisms.

    (This preprint has been reviewed by eLife. We include the public reviews from the reviewers here; the authors also receive private feedback with suggested changes to the manuscript. Reviewer #2 agreed to share their name with the authors.)

  2. Reviewer #1 (Public Review):

    This manuscript describes extensive new genome assembly resources for the Drosophilidae family. It employs Oxford Nanopore and Illumina sequencing to provide genome data and assemblies for 92 species (93 if D. melanogaster is counted), including 61 for species that previously lacked assemblies of any kind that I could easily discover. When confined to species with at least one genome assembly exhibiting high contiguity (N50 > 1Mb), the manuscript introduces 68 such species, tripling the previous quantity of such highly contiguous genomes from 34 to 102. Moreover, many of their assemblies serve to improve upon already existing highly contiguous assemblies, sometimes doing so dramatically. This is a truly impressive contribution to the genomics of Drosophilidae and will serve as an important source of genomic and genetic resources for biologists in many fields studying many topics across broad phylogenetic and geographic spans in this important clade. The species span genetic models, comparative genomics models, species with interesting ecology, and agricultural pests. Moreover, the authors carefully document their procedures for attaining these assemblies from experiments to reproducible computational resources. It is this provision of reproducibility that I think is probably the most important contribution and should serve as a role model for future work in the field. The authors also perform some preliminary genome analyses, including a neat network analysis to recapitulate the classic observation from Sturtevant and Novitski that the gene content of Muller elements is deeply conserved in the face of extensive rearrangement within the elements. They also report repeat content across their dataset.

    The dataset makes the single largest contribution to sampling over previously available resources in this group. The authors describe their contribution as not only a community resource but also as a blueprint for future work in this sort of genomics, and tout their approach as being high-quality and cost-effective, and I think this is justified. The work is described as community resource, and justifiably so. Consequently, there are certain quality control reports, refinements to the authors' recommendations, and aspects of the scholarship that should be added or expanded upon to support filling this resource role.

    * Quality control metrics: sequencing error and sample polymorphism
    One crucial aspect of a community resource is a thorough description, including quantifying limitations of the resource. The manuscript does a great job with descriptions of contiguity and completeness, but there is no quantification of potential errors at the base level or segregating variation, both of which are concern to users of genomic resources. Such descriptions are routine parts of descriptions of past resources going back to the earliest genome assemblies.

    * Guidance for future assembly work
    This work aspires to serve as a blueprint for diverse research groups to extend the approach outlined here. And I think it is well-placed to do this. To this end, the authors offer guidance about current options, accounting for costs that such efforts are likely to face. However, the manuscript does not place itself in the context of common alternative approaches that are certainly of interest for at least some of those using or contributing to the resource. An acknowledgement of alternative approaches, especially in the context of weighing strengths and weaknesses (especially cost, length, and error rates of various long read platforms in realized data), would support the mission of serving as a blueprint for future work.

    * Context and scholarship for the advances made
    Placing this resource into the context of existing high-quality genome assemblies in this clade is crucial to its users. The manuscript is written in such a way that a potential user of the resource might conclude that, prior to this work, high quality genome sequencing has been only cursory. In particular, the authors make sweeping statements like "while high quality genome assemblies exist for several species in this group, they only encompass a small fraction of the genus" and (paraphrased) "the assemblies, protocols, and pipelines described here will serve as a starting point for addressing questions in this group". While the resource does make important and exciting contributions that double or triple (depending on what is counted) the scope/quantity of high quality genome assembly available to date, the abundant resources that already exist are barely mentioned. In fact, prior to this work, there are, by my count, already 34 highly contiguous genomes representing 8 species groups and the Colocasiomyini tribe. This manuscript triples this number to 102 species across 15 species groups, though Colocasiomyini remains the most distant relative to Drosophila melanogaster sampled. The contribution of this manuscript is undiminished by this context, which is useful information for community users of the resource.

  3. Reviewer #2 (Public Review):

    Kim, Wang, et al. present the sequencing and assembly of nearly 100 species in the Drosophila clade, spanning substantially more of the ecological and phylogenetic diversity of this historically important group than ever before. To do this in a cost-effective manner, they use Nanopore long read sequencing, in combination with Illumina short read data for base-level assembly polishing. By the author's calculations, and under optimal conditions, this has the potential to allow assemblies to be produced for as low as $350 in sequencing costs, at least for organisms such as Drosophila with relatively small genomes. Using a containerized version of the Flye assembler to facilitate production of comparable assemblies across diverse compute environments, the authors are able to generate highly complete and reasonably contiguous assembles for almost all species. The assemblies produced lack annotations or comparative alignments, and are thus more a starting point than anything else for researchers interested in any particular species. Additionally, while the quality metrics the authors apply show these are high quality assemblies, the lack of measurements of consensus base-level accuracy leave some question as to the overall accuracy of these new assemblies. Nonetheless, this work immediately increases genomic resources in Drosophila many-fold, and the open nature of this work means the value of these genomes will only grow over time.

    Strengths:

    The sequencing and assembly of nearly 100 species of Drosophila, the large majority of them never before sequenced, will be an immediately valuable resource to many researchers. Based on BUSCO completeness scores, essentially all of these assemblies contain all or nearly all of the genic sequence in these genomes. While most of these genomes have hundreds to thousands of contigs, the contiguity statistics presented show that, for most genomes, a very substantial fraction of the assembly is in a few large contigs. Taken together, these metrics suggest these genomes will be highly useful for many common research questions.

    The authors provide both a detailed protocol via protocols.io for the most complex step of many long-read sequencing experiments (extracting high molecular weight DNA), and a containerized version of their optimized assembly pipeline. This is an important strength for two reasons. First, it will be immediately useful to those in the Drosophila community who wish to sequence new species not currently included in this resource, or additional strains of species with existing assemblies. Second, such a resource provides a starting point for researchers working in other groups, particularly other insects with similar genome sizes, to build upon for replicating this kind of project elsewhere. The relevance to non-Drosophila communities is somewhat limited by the Drosophila-specific nature of some recommendations, however.

    Weaknesses:

    A major focus of this paper is on the presentation of the newly sequenced genome assemblies, and thus providing an accurate assessment of their quality is of the utmost importance for researchers hoping to use this resource. The authors rely heavily on two relatively simple measures of quality: completeness as measured by the fraction of widely conserved single copy orthologs (BUSCOs) recovered, and contiguity as measured by contig N50 and related metrics (auN). However, these are relatively limited descriptions of assembly quality. Measures of base-level accuracy, e.g. from k-mers (Merqury; Rhie et al 2020), are very useful, and can guide expectations for the degree to which problems with protein truncation caused by indel errors may be present (Watson and Warr 2019; Koren et al 2019). While the low level of fragmented BUSCOs (typically under 1%) is encouraging, more robust estimates of consensus quality are an important tool for assessing long read assemblies that are missing here.

  4. Reviewer #3 (Public Review):

    Kim et. all devised protocols for DNA extraction, library preparation, sequencing and assembly of Drosophila genomes. Then they use those protocols to assembly 101 Drosophila genomes and run some preliminary analyses. The major strength of the work is that it will provide a useful resource for researchers interested in comparative analysis of Drosophila species. Results (other than the resource itself) are modest, but well supported by the data. There are a few areas where more detail about methods and/or choice of methods would be useful.

    The likely impact of this research on the field is not any particular scientific result, but the resource itself and the associated protocols. It is more difficult to predict the impact of the major scientific results of the paper (synteny, repeat content). Though the synteny results presented in figure 2 seem sound, they should not come as a surprise since the conservation of Muller element content has been known or predicted for quite some time. The repeat content results are a good cursory investigation, but probably require more careful curation (particularly of unique unannotated repeats) to make strong conclusions about the relationship between repeat content and genome size or assembly contiguity.