MicroFinder: conserved gene-set mapping and assembly ordering for manual curation of bird dot microchromosomes
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (GigaScience)
Abstract
Background
Obtaining chromosomally complete genome assemblies across the tree of life is an important goal of biodiversity genomics. However, some lineages remain recalcitrant to assembly. Birds present a substantial assembly challenge due to the presence of tiny microchromosomes that are often highly fragmented or even missing in draft genome assemblies. Bird genomes therefore require substantial expert manual curation effort via manipulation of genome-wide Hi-C contact maps, and many chromosome-level bird genome assemblies do not resolve the known karyotype.
Findings
Here, using a reference set of expert-curated bird genomes, we have identified a set of conserved proteins for the smallest and hardest to assemble microchromosomes—the dot chromosomes—and developed MicroFinder, a pipeline that uses this protein set to find small dot microchromosome fragments in draft genome assemblies to act as anchors for manual curation. We demonstrate how MicroFinder can be used to improve the speed and accuracy of bird genome curation. Furthermore, we highlight the usefulness of MicroFinder by carrying out MicroFinder-enabled re-curation of 12 previously released chromosome-scale bird genome assemblies, increasing the sequence content of dot microchromosome models.
Conclusions
We present MicroFinder, a pipeline to identify and order putative dot microchromosome scaffolds in draft genome assemblies. MicroFinder is an effective aid for bird genome assembly that dramatically speeds up manual assembly curation and improves the accuracy and sequence content of bird dot microchromosomes, even enabling improvement to genome assemblies that have already undergone expert curation.
Article activity feed
-
AbstractObtaining chromosomally complete genome assemblies across the tree of life is a major goal of biodiversity genomics. However, some lineages remain recalcitrant to assembly despite recent advances in sequencing technologies and assembly tools. Birds present a substantial genome assembly challenge due to the presence of tiny, hard to assemble microchromosomes that are often highly fragmented or even missing in draft genome assemblies. As such, bird genomes require a large amount of expert manual curation effort via manipulation of genome-wide Hi-C contact maps and many current chromosome-level bird genome assemblies do not resolve the known karyotype. Microchromosomes have distinct genetic and epigenetic features. They are GC-biased, gene-rich, highly methylated, and have distinct spatial organisation in the centre of the …
AbstractObtaining chromosomally complete genome assemblies across the tree of life is a major goal of biodiversity genomics. However, some lineages remain recalcitrant to assembly despite recent advances in sequencing technologies and assembly tools. Birds present a substantial genome assembly challenge due to the presence of tiny, hard to assemble microchromosomes that are often highly fragmented or even missing in draft genome assemblies. As such, bird genomes require a large amount of expert manual curation effort via manipulation of genome-wide Hi-C contact maps and many current chromosome-level bird genome assemblies do not resolve the known karyotype. Microchromosomes have distinct genetic and epigenetic features. They are GC-biased, gene-rich, highly methylated, and have distinct spatial organisation in the centre of the nucleus. Importantly, they are conserved across avian evolution. Here, using a reference set of expert curated bird genomes, we have identified a set of conserved microchromosome genes and developed MicroFinder, a pipeline that uses this gene set to find small microchromosome fragments in draft genome assemblies to act as anchors for manual curation of microchromosomes. We demonstrate how MicroFinder can be used to improve the speed and accuracy of bird genome curation. Furthermore, we highlight the usefulness of MicroFinder by carrying out MicroFinder-enabled re-curation of 12 previously released chromosome-scale bird genome assemblies, increasing the sequence content of microchromosome models.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag036), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 2:
I had the privilege of reviewing the manuscript titled "MicroFinder: conserved gene-set mapping and assembly ordering for manual insertion of bird microchromosomes" by Mathers et al. The manuscript presents a conserved gene set linked to bird microchromosomes for identifying putative contigs/scaffolds. Subsequently, microchromosomes contigs/scaffolds can be made into their corresponding chromosome models using orthogonal evidence from HiC data. MicroFinder utilises the current knowledge of microchromosome conservation across birds. This approach is similar to assembly evaluation method using BUSO genes.
One of the major limitation of the manuscript is the lack of validation or supportive evidence to show that manual curation results after applying MicroFinder hints are valid and robust. Authors can perform local synteny or chromosome scale alignments analyses and conservation property evaluation to demonstrate that results of assembly curation are valid. Authors can also report metrics of HiC contact maps before and after curation for inter and intra chromosomes contacts to demonstrate improvements. If this is not done, authors may have to remove results and methods corresponding to manual curation so as to focus on genes that are found in "putative" microchromosomes.
Manuscript is generally well written with some minor concerns. Analyses presented are generally robust.
It was confusing to read the difference between micro and dot chromosomes. I encourage authors to avoid "dot" chromosome term. Although it has been used in literature in the past, we can do without that term. There is no strong evidence to suggest if micro and dot chromosomes have any significant functional or system level differences. Best to avoid the term.
If authors insist on using the dot nomenclature, a justification and explanation would be required with clear definitions for both. Also, the name of the workflow may need to change as well. I leave it up to authors to make that call.
Similarly I encourage authors avoid using the term shrapnel for small unplaced contigs. Just use small unplaced contigs instead.
Finding section contains a lot of information that belongs in methods section. For example line numbers 109-117 122-125 135-137 154-156 160-164 167-172 187-192. Please revise the text so that findings section doesn't have any methods description.
A definition of what is a orthogroup and fuzzy orthogroup is required.
Result/findings section needs significant improvements. Authors have relegated much of the results to tables in supplementary information. I insist that authors summarise those results in a meaningful descriptive way and refer to supplementary information for extra details.
Lines 176-177 mentions about the manual curation of micro chromosomes. I would like to see the rules and decisions that were employed to join or break or reorder contigs/scaffolds into a chromosome model.
Authors have mentioned that 216kb-4.3mb of additional content per assembly was added. This is incorrect as the sequence content was already present in the assembly. It is just reorganised into microchromosome scaffolds. Please correct the text to say that unplaced scaffolds are organised into putative microchromosomes.
Lines 108-199 mentions about errors in original assembly. A description about the type of errors would be required.
Authors should discuss the property of eagles, falcons and parrots with rearranged/fused micro chromosomes. The proposed method may not be effective in such instances.
Authors suggest the use of 5Mbp cut off. However, in instances where a micro chromosome is incorrectly placed with a macro- chromosome may miss these instances. Authors discuss this as paralog or misalignment related issues. I suggest that authors provide a metric for the success/failure of identifying genes similar to BUSCO. Authors can run the software on all available bird genomes to define the property of such metric for each gene. Result section can explain proportions of 9400 found on macro vs micro. Proportions of 14k fuzzy genes on micro vs macro, their copy status. 9400 + 14514 doesn't add up to 16,589 orthogroup. Something is not clearly described about those numbers. Please improve the text to make meaningful assessments of conserved gene sets on Microchromosomes for it to be useful for the research community.
Methods: Lines 233-234: what is taxon in this context? Please clarify. There is also a mention of taxa with missing data. What data were missing? Please clarify.
Lines 236-237: do authors mean that chromosomes identified by the submitter of primary assembly? Please clarify.
For each species, authors should refer to refseq version of the assembly for posterity as well. Common names of species may be useful too for broad readership.
Line 254: please modify the section header to remove assembly version as they are not useful
Methods describing the orthogroup clustering should include details about how alignments were filtered and processed. This is currently missing.
Significance of phylogenetic analyses in the context of manuscript is not very clear. May be remove that section. Perhaps authors can utilise the phylogenetic distance as a way to discuss how conserved gene sets are behaving between species based on distance.
Results section can include run time and compute resource usage metrics for others to estimate resource requirements for such analyses.
Updated assemblies can be submitted to NCBI. Authors should consider this.
-
AbstractObtaining chromosomally complete genome assemblies across the tree of life is a major goal of biodiversity genomics. However, some lineages remain recalcitrant to assembly despite recent advances in sequencing technologies and assembly tools. Birds present a substantial genome assembly challenge due to the presence of tiny, hard to assemble microchromosomes that are often highly fragmented or even missing in draft genome assemblies. As such, bird genomes require a large amount of expert manual curation effort via manipulation of genome-wide Hi-C contact maps and many current chromosome-level bird genome assemblies do not resolve the known karyotype. Microchromosomes have distinct genetic and epigenetic features. They are GC-biased, gene-rich, highly methylated, and have distinct spatial organisation in the centre of the …
AbstractObtaining chromosomally complete genome assemblies across the tree of life is a major goal of biodiversity genomics. However, some lineages remain recalcitrant to assembly despite recent advances in sequencing technologies and assembly tools. Birds present a substantial genome assembly challenge due to the presence of tiny, hard to assemble microchromosomes that are often highly fragmented or even missing in draft genome assemblies. As such, bird genomes require a large amount of expert manual curation effort via manipulation of genome-wide Hi-C contact maps and many current chromosome-level bird genome assemblies do not resolve the known karyotype. Microchromosomes have distinct genetic and epigenetic features. They are GC-biased, gene-rich, highly methylated, and have distinct spatial organisation in the centre of the nucleus. Importantly, they are conserved across avian evolution. Here, using a reference set of expert curated bird genomes, we have identified a set of conserved microchromosome genes and developed MicroFinder, a pipeline that uses this gene set to find small microchromosome fragments in draft genome assemblies to act as anchors for manual curation of microchromosomes. We demonstrate how MicroFinder can be used to improve the speed and accuracy of bird genome curation. Furthermore, we highlight the usefulness of MicroFinder by carrying out MicroFinder-enabled re-curation of 12 previously released chromosome-scale bird genome assemblies, increasing the sequence content of microchromosome models.
This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giag036), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:
Reviewer 1:
I am very happy to see that MicroFinder is going to be published! Last year I used it very often to curated the bird assemblies. I found no major issues, but only the minor one.
The only crucial (but still technical issue) is that your protein dataset is from dot microchromosomes, i.e. not from the all microchromosomes. So I highly recommend to use "dot microchromosomes" where relevant including the title of the manuscript.
Minor issues:
row 19 (Abstract background) change "major goal" to a softer statement. Generation of the assemblies is a very important task of bioiversity genomics but not a major one
row 54-55 Do you imply that typical bird genome contains 37-41 chromosome pairs? There are a lot of birds with lower number of chromosome, so i am not sure that it is typical.. Also a reference to publication from 1981 looks outdated
row 109 - why only eleven assemblies were selected?
row 111 - 112 Please, highlight how many orders/families were not covered
rows 129 - 137 This lines are in some contradiction with all the text including the abstract. Your dataset is focused on a dot chromosomes and not on the all microchromosomes. I suggest to replace "microchromosomes" nearly everywhere to "dot microchromosomes" including the title
row 173 - 185 I am very skeptical about expanding the results obtained on a single genome assembly to the whole family, especially if remember that your dataset covers less than a half of bird orders. My experience with Microfinder tells that sometimes it select contigs/scaffold belonging to macrochromosomes. However, not many and they are usually short. Please, soften statements
row 429 Reference 13 is in French and doesn't have an English translation of the title
-
-
