KaryoScope: rapid, alignment-free sequence annotation for the pangenome era

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The pangenome era is producing long-read sequencing data and complete genome assemblies (1–3) at a pace that current annotation methods cannot match. Existing tools were each built for a single feature class (repeats, centromeric satellites, or genes) and falter precisely where the genome is most variable and harbours clinically important variation: the centromeres, subtelomeres, and acrocentric short arms. Here we present KaryoScope, an alignment-free method to annotate an assembly at base-pair resolution across any desired feature classes in a single pass, completing in minutes on a standard workstation. Applied to the Human Pangenome Reference Consortium Release 2 assemblies (3), KaryoScope identifies the SST1 macrosatellite as the recurrent sequence at Robertsonian translocation fusion points (4, 5), delivers the first pangenome-wide census of D4Z4 macrosatellite structural diversity at the 4q and 10q subtelomeres relevant to facioscapulohumeral muscular dystrophy (6), and reveals previously uncharacterised centromere structural polymorphism, including chromosome-specific satellite loss and megabase-scale rearrangement validated by fluorescence in situ hybridization. A pre-built KaryoScope database for the human genome is distributed alongside the tool, and additional databases can be built for any reference genome or annotation source. Together, these capabilities bring the most variable regions of the genome within reach for comparative, clinical, and pangenome-scale analysis. KaryoScope is available at https://github.com/barthel-lab/KaryoScope .

Article activity feed