SRY: An Effective Method for Sorting Long Reads of Sex-limited Chromosome

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Most of available reference genomes are lack of the sequence map of sex-limited chromosomes, that make the assemblies uncompleted. Recent advances on long reads sequencing and population sequencing raise the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. We introduce a computational method that shows high efficiency on sorting and assembling long reads sequenced from sex-limited chromosomes. It will lead to the complete reference genomes and facilitate downstream research of sex-limited chromosomes.

Article activity feed

  1. AbstractMost of available reference genomes are lack of the sequence map of sex-limited chromosomes, that make the assemblies uncompleted. Recent advances on long reads sequencing and population sequencing raise the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. We introduce a computational method that shows high efficiency on sorting and assembling long reads sequenced from sex-limited chromosomes. It will lead to the complete reference genomes and facilitate downstream research of sex-limited chromosomes.Competing Interest StatementThe authors have declared no competing interest.

    Reviewer 3. Arang Rhie

    Comments to Author:

    1. In the introduction, add recent marker based graph phasing algorithms in long-reads, such as hifiasm trio and verkko trio mode after the T2T-Y. They are different from trio-binning, which tries to phase the reads upfront. Graph based phasing is using markers to determine haplotype specific paths to traverse. a. T2T-Y chromosome should be referencing Rhie et al., Nature 2023. Verkko is a successor of the manual efforts taken in T2T-Y, which should be also noted in the introduction. b. Reference for sexPhase program is still missing. Also, some rephrasing of the sentence is needed, as the way it is currently written is easily misleading to be understood as sexPhase was part of the methods used in the assembly of the T2T-Y.
    2. There are other approaches for phasing genomes taken in plants, for example the poly ploid potato phasing using many siblings of the child by Mari et al. bioRxiv 2022.3. "But only one male and one female could suffer from sampling error" - this part is unclear. Please clarify.
    3. Reference for the mason_simulator, badread software is missing.
    4. Provide the accession (HG02982) for the "African human Y" in the main text.
    5. I appreciate that the authors compared assemblies to T2T-Y as I requested before. However, fundamentally, mapping to T2T-Y and comparing length of each sequence classes is comparing apples to oranges, particularly in the heterochromatic region and ampliconic region of the Y. It is known to have variable copy numbers and size differences between two individuals. Frequent inversions have been reported in the ampliconic regions across different Y haplogroup. The number, size, and distribution of the repeat arrays composing the heterochromatic region has been shown to vary among different Y haplogroups in Hallast et al., Nature 2023. This can be also seen in Fig. 3c; the overall depth of the flow sorting in the heterochromatic region is below 1 - indicating the Yqh is shorter than T2T-Y, as it is in Fig. 3b. To make the benchmark legit, the authors should compare SRY and the flow sorting method using samples from the same individual. HG02982 and HX1 are presumably having very different sequence compositions given the diverged population history (African vs. Asian). Comparing total length of the assembled region against a 3rd different Y haplogroup (HG002Y) makes things more complicated, especially on regions that are known to vary a lot. If the authors think flow sorting based method needs to be compared, it should be benchmarked on the same individual to make an apple-to-apple comparison. I do agree results from read sorting (i.e. portion of reads sequenced from non-Y chromosomes in SRY vs. flow-sorting) is an important finding. However, I'd still argue comparing assemblies from the two different Y haplogroups is a stretch. The authors could have performed the same assembly length comparison on the T2T-Y using results from their SRY sorted reads with Verkko of HG002 vs. Verkko assembly using trio-binned markers.
    6. In the section where assemblies are compared, the authors point to Table 1, which contains results from HG01109. HG01109 has never been mentioned before. I thought the authors were comparing assemblies from SRY sorted reads of HX1? I am not sure why the authors suddenly added a 3rd PUR genome with no context. Was this a mistake? Add results from HX1 to Table 1.
    7. Please add divider lines in Table 1 between All / Ampliconic / X-degenerate / X-transposed / PAR / Het / Others. It is hard to see which rows belong to which category.
    8. The last result section where authors compare results from Verkko, it is unclear how the verkko assembly was run. The authors say "default option", and later "in trio mode" in the methods. Did the authors collect parental reads from HG002 (HG003 and HG004)? How was "trio mode" performed? Did the authors used trio binning to sort the reads, then run Verkko? Or used the homopolymer compressed parental kmers and used that in the Rukki step of Verkko (and this should be benchmarked)? Was the HG002 trio assembly taken from Rautiainen et al. paper? Please clarify and add the missing parts to the main text and methods.
    9. Related to the above section, it is hard to see in Fig. 4a the "two approximately 1 Mb contigs aligning to the same region of the Y chromosome". An enlarged inset of the dotplot may be helpful. Also, add legends and scale to the X and Y axis of the dotplots.
    10. Note there is a mis-assembly reported on T2T-Y palindrome P5 (https://github.com/marbl/CHM13-issues/blob/main/v2.0_issues.bed), which the entire P5 should be inverted. I don't see this in the dotplots of Fig. 4.
    11. In the discussion, the authors are mentioning results from the 10 trios that have been removed from the previous results. Please add the 10 trio results to the main text if it was a mistake, or remove the irrelevant results from the Discussions and Supp. Tables.
    12. The authors discuss the suboptimal performance of SRY in the PAR is contributed by the restricted data types. I thought it was contributed by the lower density of the markers? The PAR parental marker density was very similar to that of autosomes, with stretches of runs of homozygosity, presumably to maintain enough homology for recombination. What was the marker density in the PAR? Was it below their 7 kmer / 1kb?
    13. The authors mentioned there are no ZW genomes available to test SRY. There is a Zebra finch trio (ZW, female, bTaeGut2) and a male sample (ZZ, male, bTaeGut1) available with HiFi of the child (bTaeGut2) and Illumina of all the genomes from the Vertebrate Genomes Project (Rhie et al., Nature, 2021). Perhaps the authors could apply SRY on this individual, and compare the W chromosome results to what has been released on https://www.genomeark.org/vgp-all/Taeniopygia_guttata.html.

    Re-review: The authors have addressed most of my concerns. The revised manuscript reads much better than before. Regarding my last comment and response from the authors about the W chromosome, I was hoping to see comparable coverage of the W chromosome to the reference, as a proof of principle that SRY could be applied to non-human, highly diverged genomes. The assembly looks very fragmented though. Was it only the similarity to the Z chromosome that caused the fragmentation? Are there no other factors contributing to the discontinuity of the W chromosome? A few minor comments below to the revised version:

    1. Please indicate which genome was compared in the legend of Supp. Table 5. 2.When using et al notations, please use the last name. Mari et al should be Serra Mari et al., Mikko et al should be Rautiainen et al. Also, Serra Mari et al is now published in Genome Biology: https://doi.org/10.1186/s13059-023-03160-z. Please update the reference.
    2. There are a few grammar corrections to make.
  2. AbstractMost of available reference genomes are lack of the sequence map of sex-limited chromosomes, that make the assemblies uncompleted. Recent advances on long reads sequencing and population sequencing raise the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. We introduce a computational method that shows high efficiency on sorting and assembling long reads sequenced from sex-limited chromosomes. It will lead to the complete reference genomes and facilitate downstream research of sex-limited chromosomes.Competing Interest Statement

    Reviewer 2. Shilpa Garg

    Comments to Author: The SRY method, developed and evaluated for sorting long reads of sex-limited chromosomes, has shown promise in effectively identifying and sorting sequences based on sex-specific markers, particularly the Y chromosome. These sorted long reads are then utilized for genome assembly. Additionally, the SRY method can be used to select Y chromosome contigs from a male individual's whole genome assembly. Overall, the success of SRY in sorting and assembling long reads of sex-limited chromosomes highlights its potential as an alternative to experimental methods for studying sex-specific genomic regions. Here are some comments for further improvement of manuscript:

    1. The authors may want to consider to presenting a table for standard evaluation metrics (k-mer or alignment-based). See Garg 2021 (https://genomebiology.biomedcentral.com/articles/10.1186/s13059-021-02328-9).2) Adding a few important genes that are medically relevant and assembled properly may further add value to the work.
  3. AbstractMost of available reference genomes are lack of the sequence map of sex-limited chromosomes, that make the assemblies uncompleted. Recent advances on long reads sequencing and population sequencing raise the opportunity to assemble sex-limited chromosomes without the traditional complicated experimental efforts. We introduce a computational method that shows high efficiency on sorting and assembling long reads sequenced from sex-limited chromosomes. It will lead to the complete reference genomes and facilitate downstream research of sex-limited chromosomes.

    This work has been published in GigaScience Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giae015), and has published the reviews under the same license. These are as follows.

    Reviewer 1: Zuyao Liu, Ph.D

    Comments to Author: The authors have introduced a novel bioinformatic approach for sex chromosome assembly, addressing a persistently challenging problem in genomics. This method harnesses the full potential of whole-genome resequencing data without necessitating supplementary experimental procedures, rendering it applicable to a wide array of non-model species. Notably, the method exhibits robustness when applied to human data, surpassing established techniques such as flow-sorting and trio-binning. While the manuscript exhibits promise, several key aspects warrant refinement and elucidation to bolster its consideration for publication in GigaScience.

    1. Language Polishing: A degree of language refinement is advisable to enhance the overall clarity and professionalism of the manuscript.
    2. Y Chromosome Assembly Discrepancy: The authors should acknowledge and provide an explanation for the substantial difference between the length of the latest Y chromosome assembly from T2T (~62Mb) and the assembly from SRY with Verkko (~23Mb), as detailed in Table 1.
    3. Y Chromosome Completeness: In cases where the Y chromosome assembly is incomplete, the inclusion of a figure or table delineating the proportion that SRY can recover in distinct regions of the Y chromosome would be beneficial. This could facilitate a comparative analysis of the method's efficacy across different regions.
    4. Figure 4 Clarity: It is imperative to label the coordinates on both the X and Y axes in Figure 4 to enhance clarity. While Figure 4 suggests that the assembly from SRY is complete compared to T2T-CHM13, the total length of the SRY assembly (approximately 23Mb) should be clearly reconciled with this observation.
    5. Table 1 Organization: The organization of Table 1 should be improved to enhance readability and comprehensibility.
    6. MSK-Based Read Filtering: Authors should explicitly address the potential exclusion of reads from Y regions with lower than average MSK, especially in species with both young and old parts on Y chromosomes. If possible, provide recommendations or strategies for rescuing such reads.
    7. Simulation for species with young sex chromosomes: It is essential to conduct additional simulations for testing the efficiency of isolating Y reads for species with young sex chromosomes. This analysis should consider the variation between X and Y chromosomes, aiding researchers in evaluating the method's suitability for their specific study organisms.

    Addressing these points will further strengthen the manuscript's scientific rigor and its suitability for publication in GigaScience.

    Re-review: After reading the revised article, the questions I had previously posed were answered. I am very interested in this SRY method and believe it is also an important part of sex chromosome research. From my personal point of view, it is not easy to collect Trio data for most species except a few, but it is relatively easy to collect HIC data. It would be helpful if the authors could also compare the results of SRY HIFI with those of Hifiasm (HIC phased) to help people choose the right tool for sex chromosome assembly. However, this is not necessary, because SRY has achieved a very good result in humans. Overall, the data and results are convincing.