Improved chromosome level genome assembly of the Glanville fritillary butterfly ( Melitaea cinxia ) based on SMRT Sequencing and linkage map

This article has been Reviewed by the following groups

Read the full article

Abstract

The Glanville fritillary ( Melitaea cinxia ) butterfly is a long-term model system for metapopulation dynamics research in fragmented landscapes. Here, we provide a chromosome level assembly of the butterfly’s genome produced from Pacific Biosciences sequencing of a pool of males, combined with a linkage map from population crosses. The final assembly size of 484 Mb is an increase of 94 Mb on the previously published genome. Estimation of the completeness of the genome with BUSCO, indicates that the genome contains 93 - 95% of the BUSCO genes in complete and single copies. We predicted 14,830 gene models using the MAKER pipeline and manually curated 1,232 of these gene models. The genome and its annotated gene models are a valuable resource for future comparative genomics, molecular biology, transcriptome and genetics studies on this species.

Article activity feed

  1. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab097), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Shanlin Liu

    The authors presented us with an improved genome for Glanville fritillary butterfly. However, there are several issues that need to be addressed before its acceptance.

    Major:

    What the current manuscript lacks the most is the comparison between the improved genome assembly and its former version. Although the authors showed us an improved N50, I failed to find the explanations for several critical differences. For example, (1) the authors stated that ca. 90 MB additional assembly sequences were achieved, but no further information is available for those new sequences, are they redundancies or missed fragments in the version 1; (2) the improved genome predicted less genes compared to its former version, decreasing from ca. 16,000 genes to ~ 14,000 genes, which is contradictory to the aforementioned longer genome assembly; (3) the former genome version observed unevenly distributed repeat elements across chromosomes, while not in this improved one, which also needs explanations.

    Another important issue of the present manuscript is the confusion introduced by varied genome assembly sizes. Firstly, the authors did not provide this critical information that can be estimated using several well-known methods, such as C value based on flow cytometry, or estimations based on kmer frequency information. Secondly, the author firstly mentioned that they sampled individuals that have low heterozygosity, but later the FALCON generated an assembly almost twice the size of the final genome. The authors may want to add extra analysis or words to clarify the genome size uncertainty. Same to the above concern, Haplomerge seems an important step to obtain the final version assembly, and if I understand it correctly, the authors did not use a standardized analysis pipeline, please consider to include a schematic plot for your procedure to help readers better understand your steps and the principle behind them.

    In addition, lots of methods are vaguely described, the authors should provide details for them to make sure the analyses are repeatable, e. g. on Page 6, the authors wrote: "This cut-off was experimentally found to give the best contiguity for the assembly, while minimizing (within a small margin of error) the percentage of possibly erroneous contigs". But I failed to find any details of their experiments. And on the same page, the authors checked putative chimerics manually, saying the error regions are with low coverage or repeat regions, the authors should give demonstration examples and statistics for different kinds of errors. Meanwhile, when they say the error regions were split, the authors should give details about how they determined the split positions since what they found are error regions instead of error bases. Also, on page 7, the authors stated "The contigs orders and orientations were manually fixed when needed", please list the different situations that meet your criteria. The author may want to explain why they choose the 1,232 genes for manual annotation. Random?

    Minor: Remove

    "(e.g. Kahilainen et al. unpubl.)", it provides no useful information.

    Table 1. N(%) of the verion 2 genome is zero? The scaffolding step does not introduce any Ns? I doubt that.

    Page 5, please give the location information instead of a citation.

    Page 7, please clarify the assembly version for raw read mapping, is it the one generated by FALCON with a genome size ~ 700 MB?

    Page 9, "the first two step (bath A1 and bath A2)", please provide biological explanations.

    Marey map needs citation and a brief explanation of its debut.

    "In M. cinxia the repeats are placed in single chromosomes whereas in H. melpomene they are present in all chromosomes. " How does it help to show the power of long read assembly? Need explanation.

    Page 10, how does Velvet apply a kmer size of 99 bp when you only have a read length as long as 85 bp?

    Table 2 title: species name should be in format of italic.

    Please give a full name for BUSCO in its first appearance.

  2. This work has been peer reviewed in GigaScience (see paper https://doi.org/10.1093/gigascience/giab097), which carries out open, named peer-review.

    These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Annabel Charlotte Whibley

    In this manuscript, Blande, Smolander and colleagues report an improved chromosome-level genome assembly of the important ecological model lepidopteran species Melitaea cinxta. The manuscript would benefit from further language review by a native English speaker to improve readability, but the intentions of the authors are nevertheless clearly articulated throughout, the workflow is logical, and the assembly quality is a clear improvement on the earlier draft release.

    I would suggest revisiting the title to better reflect the work- as it stands it is a little underwhelming. One suggestion would be "Improved chromosome-level genome assembly of the Glanville fritillary butterfly (Melitaea cinxia) integrating PacBio long reads and a high-density linkage map". I would ideally also like to see more discussion of the more unusual aspects of this project- for example, long-read assemblies are commonplace now, but the linkage map approach (and the extent to which there was manual curation of potential chimeric scaffolds) is less frequently employed these days and often superscaffolding and error correction is undertaken with Hi-C methods only. Similarly the extensive manual curation of gene annotations and the impact that this had on the models is likely of more general interest (e.g. how many gene models were corrected, what type of errors were encountered?). Particularly also some mention of some of the specific challenges of this project (e.g. the need to combine multiple individuals to obtain sufficient quantities of gDNA) might be interesting for the readership.

    The absence of line numbers is a little cumbersome for reviewing purposes, I'll below refer to specific parts of the text by page number (as printed on pdf document) - paragraph -line(within paragraph). 3-1-6: suggest changing "…. and included both laboratory and natural environmental conditions" to "…and have included…"

    3-2-1: change "The first M. cinxia genome was released in 2014" to "The first M. cinxia draft genome" or "The first M. cinxia genome assembly"

    Table1: reporting both GC and AT % is unnecessary. There are some discrepancies between the statistics reported for the chromosomal assembly in the Ahola et al (2014) paper vs this table. This may simply be due to different methods for assessing summary statistics (e.g. whether or not gaps are included by default), but warrants investigation/clarification. For example, the largest scaffold reported in the Ahola et al (2014) paper is 14,178,551bp. The description of the generation of a chromosomal build for the previous version indicates >280Mb were assigned to chromosomes, whereas the total assembly size in this table is reported to be only 251Mb.

    6-2-2: What are the units for the cut-off (read length?)? If available, the data exploring the impact of different cut-offs on the assembly error rate could be of interest to others assembling genomes de novo. 6-2-6: As a specific example of a more general comment on number reporting, perhaps state 24.4 Gb instead of 24,409,505,551 bp? I am not sure that the precision is always necessary and scaling/rounding can help readability.

    6-2-10: Are the alternative contigs extracted by default by the FALCON pipeline? Are there any adjustments that need to be made for an input of >1 individual, for example?

    7-2-2: The raw data for the linkage map crosses, and also the RNAseq data for the transcriptome studies (on ) is described as "unpublished", but I believe public sequence accessions are also being released with this manuscript. Is there additional information that would need to be disclosed for this information to be utilised by others or is the intention to highlight that the data will also be presented in upcoming publications?

    7-2-6 "Part of" should be "Some of"

    7-3-3: Specify "relative humidity" instead of RH. Discuss why different approaches used for different RNAseq experiments.

    8-1-5: Sequencing was "performed" rather than "made". Can you specify which HiSeq model and which sequencing library kit (or at the very least whether it was PCR-free)?

    9-2-6: Presumably "de novo transcripts" refers to both transcriptomes 1 and 2, in which case I think it would be helpful to state this here. I assume the different analysis approaches for datasets 1 and 2 reflect different histories of the two datasets but it would be interesting to see some assessment of the relative performances of these approaches.

    13-2-4: I think that http://butterflygenome.org would be sufficient for the URL here.

    14-1-4: Are there any flow cytometry (or other) estimates of genome size that can be used to set alongside the v1 and v2 assembly sizes?