STRmie-HD enables interruption-aware HTT repeat genotyping and somatic mosaicism profiling across sequencing platforms
This article has been Reviewed by the following groups
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
- Evaluated articles (PREreview)
Abstract
Short tandem repeat expansions in exon 1 of the HTT gene drive Huntington’s disease (HD) pathogenesis, with disease onset and progression heavily influenced by somatic mosaicism and sequence interruptions. While sequencing technologies enable repeat sizing, many computational tools lack the resolution to capture subtle interruption motifs and allele-specific somatic variation. We present STRmie-HD, an alignment-free, de novo framework for interruption-aware genotyping and quantitative profiling of somatic mosaicism at single-read resolution. The tool parses individual reads to quantify uninterrupted CAG tract length, CCG repeat content, and critical interruption variants, including Loss of Interruption (LOI) and Duplication of Interruption (DOI). Validated across Illumina, PacBio SMRT, and Oxford Nanopore platforms, STRmie-HD demonstrates high concordance with reference genotypes and superior sensitivity in identifying rare interruption patterns that conventional tools often overlook. Furthermore, it implements somatic mosaicism metrics to characterize repeat dynamics, successfully distinguishing the higher somatic expansion burden in brain tissues compared to peripheral blood. STRmie-HD offers a comprehensive and extensible solution for high-resolution molecular characterization of HTT variation, providing a robust framework for patient stratification and genetic research in HD.
Graphical Abstract
Graphical Abstract:
STRmie-HD flowchart. STRmie-HD is a comprehensive analytical framework that processes sequencing reads to analyze CAG/CCG trinucleotide repeats, interruption variants, and somatic mosaicism in the HTT gene. The workflow begins with sequencing reads (FASTA/FASTQ) that can undergo optional custom processing eq]based on the sequencing design. These reads are then fed into a regular expression-based engine (STRmie-HD) to identify CAG and CCG motifs. The identified motifs lead to the estimation of CAG/CCG alleles, visualized as distinct peaks representing different allele sizes, interruption variant assessment, and somatic mosaicism quantification. STRmie-HD produces an HTML output that wraps this information into a report.
Article activity feed
-
This Zenodo record is a permanently preserved version of a Structured PREreview. You can view the complete PREreview at https://prereview.org/reviews/20241463.
Does the introduction explain the objective of the research presented in the preprint? Yes Explaining the biological and clinical complexity of Huntington's disease (HD) Highlighting limitations in current sequencing and computational methods Identifying a specific gap (lack of tools that simultaneously handle repeat size, somatic mosaicism, and interruption variants) Explicitly stating the proposed solution: STRmie-HD and its purposeAre the methods well-suited for this research? Highly appropriate …This Zenodo record is a permanently preserved version of a Structured PREreview. You can view the complete PREreview at https://prereview.org/reviews/20241463.
Does the introduction explain the objective of the research presented in the preprint? Yes Explaining the biological and clinical complexity of Huntington's disease (HD) Highlighting limitations in current sequencing and computational methods Identifying a specific gap (lack of tools that simultaneously handle repeat size, somatic mosaicism, and interruption variants) Explicitly stating the proposed solution: STRmie-HD and its purposeAre the methods well-suited for this research? Highly appropriate The methods are well-aligned with the stated research objective and reflect strong methodological rigor: They directly address the identified gap (simultaneous detection of repeat length, interruption variants, and somatic mosaicism). The per-read parsing approach is appropriate for capturing heterogeneity and mosaicism. The use of a regular expression–based, alignment-free strategy is well-justified given the limitations of reference-based methods for repeat expansions. Inclusion of quantitative indices (EI, II) strengthens downstream interpretability. The ONT-specific handling demonstrates awareness of platform-specific limitations and best practices. The framework is adaptable (ROI filtering, customizable parameters), which enhances robustness. Overall, the methods are thoughtfully designed, technically sound, and clearly tailored to the biological and computational challenges outlined in the introduction.Are the conclusions supported by the data? Highly supported The conclusions are well supported by the data presented in the manuscript: The authors provide extensive benchmarking across four datasets (Illumina, PacBio, ONT, and synthetic), demonstrating consistent performance of STRmie-HD across different sequencing platforms. Quantitative metrics such as MAE, RMSE, and correlation coefficients directly support claims of high accuracy and robustness. The conclusions about superior or comparable performance vs. other tools are backed by explicit comparative results (e.g., ScaleHD, TRGT, RepeatDetector). Claims regarding interruption variant detection are supported by: Orthogonally validated samples, Quantitative read-level percentages, Clear evidence of improved detection over existing tools Biological conclusions (e.g., higher somatic expansion in brain vs blood) are supported by statistical analysis (Kruskal–Wallis test with significant p-values). The discussion appropriately includes limitations and caveats (e.g., dependence on sequencing platform, preprocessing), avoiding overstatement. Overall, the authors do not overreach—their conclusions align closely with the empirical results and are framed appropriately within the scope of the study.Are the data presentations, including visualizations, well-suited to represent the data? Somewhat appropriate and clear The manuscript uses a variety of appropriate visualizations: Histograms for repeat distributions, Scatter plots for correlation with ground truth, Bar plots and tables for performance metrics, Boxplots for biological comparisons (e.g., EI across tissues) Figures are aligned with the type of data. Inclusion of quantitative tables (MAE, RMSE, CI) improves clarity and supports interpretation. Providing raw histograms and outputs as supplementary material supports transparency and reproducibility. Limitations Some visualizations (especially histograms and multi-tool comparisons) may be: Dense or harder to interpret without domain expertise Not fully optimized for quick interpretability by broader audiences Accessibility considerations (like simplified summaries, clearer legends, or visual consistency across figures) could be improved. Heavy reliance on supplementary materials for full interpretation slightly reduces immediate clarity.How clearly do the authors discuss, explain, and interpret their findings and potential next steps for the research? Somewhat clearly The authors provide a clear, logically structured discussion that (1) restates the unmet need, (2) interprets the benchmark results across platforms, (3) highlights what is novel about STRmie-HD (especially interruption-aware, single-read quantification), and (4) outlines practical implications and extensions.Is the preprint likely to advance academic knowledge? Highly likely The preprint makes meaningful and substantive contributionsWould it benefit from language editing? NoWould you recommend this preprint to others? Yes, it's of high qualityIs it ready for attention from an editor, publisher or broader audience? Yes, after minor changes The manuscript does not require major rewriting, but professional polishing (clarity, conciseness, flow) would significantly improve readability and impact.Competing interests
The author declares that they have no competing interests.
Use of Artificial Intelligence (AI)
The author declares that they did not use generative AI to come up with new ideas for their review.
-