TinkerHap - A Novel Read-Based Phasing Algorithm with Integrated Multi-Method Support for Enhanced Accuracy

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Abstract

Phasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants, reliance on external reference panels, and constraints in regions with sparse genetic variation.

To address these limitations, we developed TinkerHap, a novel and unique phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap’s performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short-reads) and GIAB Ashkenazi trio (PacBio long-reads). TinkerHap’s read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short-reads (second best: 94.8%) and 97.5% for long-reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 base-pairs for long-reads (second best: 68,303 bp) and demonstrated higher accuracy for both SNPs and indels. This combination of a robust read-based algorithm and hybrid strategy makes TinkerHap a uniquely powerful tool for genomic analyses.

Article activity feed

  1. AbstractPhasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants, reliance on external reference panels, and constraints in regions with sparse genetic variation.To address these limitations, we developed TinkerHap, a novel and unique phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap’s performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short-reads) and GIAB Ashkenazi trio (PacBio long-reads). TinkerHap’s read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short-reads (second best: 94.8%) and 97.5% for long-reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 base-pairs for long-reads (second best: 68,303 bp) and demonstrated higher accuracy for both SNPs and indels. This combination of a robust read-based algorithm and hybrid strategy makes TinkerHap a uniquely powerful tool for genomic analyses.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf138), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 3: Julia Markowski

    In the presented Technical Note "TinkerHap - A Novel Read-Based Phasing Algorithm with Integrated Multi-Method Support for Enhanced Accuracy" by Hartmann et al., the authors introduce TinkerHap, a new hybrid phasing tool that primarily relies on read-based phasing for both short- and long-read sequencing data, but can additionally incorporate externally phased haplotypes, enabling it to build upon phase information derived from existing statistical or pedigree-based phasing approaches. This hybrid approach addresses an important and timely challenge in the field: integrating the complementary strengths of different phasing strategies to improve the accuracy and span of haplotype blocks, particularly for rare variants, or in variant-sparse genomic regions. The authors clearly articulate the limitations of existing approaches and present their solution in a manner that is both elegant and accessible. Design features such as multiple output formats and compatibility with third-party tools demonstrate a practical awareness of user needs. The authors evaluate TinkerHap using both short-read and long-read state-of-the-art benchmarking datasets, and compare its performance against commonly used phasing tools, demonstrating improvements in both phasing accuracy and haplotype block lengths. Overall, this is a well-conceived and thoughtfully implemented contribution to the phasing community.

    While the manuscript is overall well written, there are a few areas where additional clarification or extension would improve its impact. I recommend the following revisions to help clarify key aspects of the method, enhance the generalizability of the evaluation, and align the manuscript more closely with journal guidelines.

    Major Comments

    • (1) Limited scope of benchmarking The evaluation on the highly polymorphic MHC class II region is appropriate for highlighting TinkerHap's strengths in phasing rare variants in variable regions. However, the current evaluation on short -read based phasing is based on a ∼700 kb region selected for its high variant density, which limits the generalizability of the findings. Since the manuscript emphasizes improved performance in regions with sparse genetic variation, it would strengthen the work to include chromosome-wide or genome-wide benchmarks, particularly on short-read data. This would also provide a more balanced comparison with tools like SHAPEIT5, which predictably underperform in the MHC class II region due to their reliance on population allele frequencies and linkage disequilibrium patterns that are less effective for rare or private variants.
    • (2) Coverage and scalability The manuscript describes TinkerHap as scalable, but since the algorithm relies on overlapping reads, it is unclear how its performance varies with sequencing depth. Including a figure or supplementary analysis showing phasing accuracy, runtime, and memory usage at different coverage levels (particularly for short-read data) would help support this claim and guide users on appropriate coverage requirements.
    • (3) Clarify algorithmic novelty It would be helpful to elaborate on how TinkerHap's read-based phasing algorithm differs from existing approaches such as the weighted Minimum Error Correction (wMEC) framework implemented in WhatsHap. For example, what specifically enables TinkerHap's read-based mode to produce longer haplotype blocks than other read-based tools?
    • (4) Data description A brief characterization of the input datasets, such as the sequencing depth, as well as the number and average genomic distance of heterozygous variants in the MHC class II region and the GIAB trio data would provide important context for interpreting the reported phasing accuracy and haplotype block lengths.
    • (5) Manuscript structure Since the algorithm itself is the core novel contribution, it should be part of the results section, as well as the description of the evaluation currently in placed in the discussion. According to GigaScience's Technical Note guidelines, the method section should be reserved for "any additional methods used in the manuscript, that are not part of the new work being described in the manuscript."

    Minor Comments

    • (a) Novelty of hybrid approach While TinkerHap's ability to integrate externally phased haplotypes is valuable, similar functionality exists in other tools, for example, SHAPEIT can accept pre-phased scaffolds (including those generated from read-based phasing), and WhatsHap supports trio-based phasing. Consider refining the language to more precisely describe what is uniquely implemented in TinkerHap's hybrid strategy. It would be interesting to see how the presented results of using SHAPEIT's phasing output as input for TinkerHap compare to an approach of feeding TinkerHap's read-based phasing results into SHAPEIT.
    • (b) Reference bias claim The introduction states that read-based phasing is "independent of reference bias." While this approach is generally less susceptible to reference bias than statistical phasing, bias can still arise during the read alignment stage, potentially affecting downstream phasing. This point should be clarified.
    • (c) GIAB datasets The abstract mentions only the GIAB Ashkenazi trio, but later the Chinese trio is included in the analysis as well. Please clarify whether results are averaged across the two datasets.
    • (d) Tool version citation Please clarify in the text that the comparison was made using SHAPEIT5, not an earlier version.

    Recommendation: Minor Revision With additional clarification on generalizability and coverage sensitivity, this manuscript will make a valuable contribution to the field.

  2. AbstractPhasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants, reliance on external reference panels, and constraints in regions with sparse genetic variation.To address these limitations, we developed TinkerHap, a novel and unique phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap’s performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short-reads) and GIAB Ashkenazi trio (PacBio long-reads). TinkerHap’s read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short-reads (second best: 94.8%) and 97.5% for long-reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 base-pairs for long-reads (second best: 68,303 bp) and demonstrated higher accuracy for both SNPs and indels. This combination of a robust read-based algorithm and hybrid strategy makes TinkerHap a uniquely powerful tool for genomic analyses.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf138), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 2: Yilei Fu

    TinkerHap is a read-based phasing algorithm designed to accurately assign alleles to parental haplotypes using sequencing reads. General comments:

    1. The manuscript would greatly benefit from the inclusion of a flowchart or schematic overview of the TinkerHap algorithm. Given that the method incorporates multiple components—including read-based phasing, pairwise distance-based unsupervised classification, and optional integration with statistical phasing tools like ShapeIT—a visual diagram would help readers grasp the workflow more intuitively. Major comments:
    2. The authors are missing experiments for long-read based phasing. How does TinkerHap performs with ShapeIT on PacBio long-reads? I would suggest the authors using the same phasing method class as their short-read analysis: TinkerHap+ShapeIT; TinkerHap; WhatsHap; HapCUT2; ShapeIT. Also I believe ShapeIT is capable to take long-read SNV/INDEL calls as vcf.
    3. Following up on the point 1, the experimental design of this study is quite skewed. WhatsHap is not suitable for short-read sequencing data. It does not make sense to apply WhatsHap on short-read data.
    4. I would caution the authors to read and potentially compare with SAPPHIRE (https://doi.org/10.1371/journal.pgen.1011092). This is a method that developed by the ShapeIT team for incorporating long-read sequencing data and ShapeIT.
    5. To better justify the hybrid strategy, I recommend adding an analysis of sites where TinkerHap and ShapeIT disagree. Are these differences due to reference bias, read coverage, variant type, or true ambiguity? Such an evaluation would help users understand when to rely on the read-based output vs. ShapeIT, and enhance confidence in the merging strategy. Minor comments:
    6. I could see the versions of the software in the supplementary github, but I think it is also important to include those in the manuscript. For example, shapeIT 2-5 are having quite different functions. The citation for ShapeIT in the manuscript is for ShapeIT 2, but the program that has been used is for ShapeIT 5.
    7. Need to mention the benchmarking hardware information for runtime comparison.
    8. "...a novel and unique phasing algorithm..." -> "...a novel phasing algorithm..."
  3. AbstractPhasing, the assignment of alleles to their respective parental chromosomes, is fundamental to studying genetic variation and identifying disease-causing variants. Traditional approaches, including statistical, pedigree-based, and read-based phasing, face challenges such as limited accuracy for rare variants, reliance on external reference panels, and constraints in regions with sparse genetic variation.To address these limitations, we developed TinkerHap, a novel and unique phasing algorithm that integrates a read-based phaser, based on a pairwise distance-based unsupervised classification, with external phased data, such as statistical or pedigree phasing. We evaluated TinkerHap’s performance against other phasing algorithms using 1,040 parent-offspring trios from the UK Biobank (Illumina short-reads) and GIAB Ashkenazi trio (PacBio long-reads). TinkerHap’s read-based phaser alone achieved higher phasing accuracies than all other algorithms with 95.1% for short-reads (second best: 94.8%) and 97.5% for long-reads (second best: 95.5%). Its hybrid approach further enhanced short-read performance to 96.3% accuracy and was able to phase 99.5% of all heterozygous sites. TinkerHap also extended haplotype block sizes to a median of 79,449 base-pairs for long-reads (second best: 68,303 bp) and demonstrated higher accuracy for both SNPs and indels. This combination of a robust read-based algorithm and hybrid strategy makes TinkerHap a uniquely powerful tool for genomic analyses.

    This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giaf138), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

    Reviewer 1: Arang Rhie

    The authors present TinkerHap, a tool that accepts a variant call set and read alignment, and assigns heterozygous variants and reads to a particular haplotype based on a greedy pairwise distance-based classification. It accepts a pre-phased VCF as an option to further extend phased blocks. The results sound neat with statistics making it look the greatest compared to current state-of-the-art read alignment based phasing methods such as HapCut2, WhatsHap, and ShapeIT which uses statistical inference from reference panel data. However, there are several aspects the authors need to address to make their results more compelling.

    1. The benchmarking was only performed on MHC Class II, which is a relatively small and easy to phase region based on the high level of heterozygosity. How does the statistics look when applied to the whole genome? After generating the phased read set, what % of reads can be accurately assigned to the original haplotype in the whole genome scale? To benchmark the latter, I would recommend doing it on HG002 phased variants and reads by using the HG002Q100 genome (https://github.com/marbl/hg002) - i.e. map the classified reads and calculate the coverage and accuracy based on where the reads align to. I would be curious to see how the MHC Class II phased read alignment looks like on the HG002Q100 truth assembly, on each haplotype.
    2. When showing benchmarking results, key features are missing - 1) number of heterozygous variant sites are used for phasing, in addition to the Phased % (what's the denominator here?), 2) number of phase blocks, phase block NG50 and total length and 3) Show the NGx length distribution by plotting the cumulative covered genome length as a function of the longest to shortest phase block.
    3. After phasing the variants (and reads), are the authors accurately able to type the HLA Class II genes? The goal of MHC phasing is to accurately genotype the HLA-genes. It is unclear to me why the authors applied their phasing on the 1,040 parent-offspring trios. I agree that it is 'phasable', however, it is unclear what the motivation here is - the MHC Class II is particularly known to have linked HLA types (e.g., HLA-DRB3 and HLA-DRB5 are inherited together depending on the HLA-DRB1 type, while in some haplotypes HLA-DRB3 is entirely missing), and depending on the HLA types and because the reference is incompletely representing this locus, there are multiple tools developed for genotyping this locus. I would be more convinced if the authors could show the HLA genotyping accuracy together based on their phasing method.
    4. Is it possible to use additional data types to further extend the phase blocks, by using datasets such as low coverage PacBio data in addition to the short-read WGS? How about phasing with linked-reads or Hi-C? Both Whatshap and HapCut2 are specifically designed to combine such short and long-range datasets, giving the advantage of using such tools.
    5. The authors claim their method is free from reference bias, which I strongly disagree. Using a bam file aligned to a reference inherently has the issue of mapping biases, so any such tools are limited by the reads that aligns incorrectly. Repeats, especially copy number variable region with collapses in the reference are very difficult to accurately phase. Any large structural variant not properly represented in the reference will cause problems due to unmapped reads.
    6. In Methods, 2nd section - I would suggest to use allele 1 and allele 2 instead of 'reference' and 'alternative' in the equation and the code. This will increase the number of heterozygous 'phasable' variants that does not carry any reference allele.