Measuring DNA contents of animal and plant genomes with Gnodes, the long and short of it
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Measurement of DNA contents of genomes is valuable for understanding genome biology, including assessments of genome assemblies, but it is not a trivial problem. Measuring contents of DNA shotgun reads is complicated by several factors: biological contents of genomes at species, individual and tissue or cell levels, laboratory methods, sequencing technology and computational processing for measurement and assembly. This compares, and shares, complications with cytometric (Cym) and related molecular measurements of genome size and contents.
There is an obvious discrepancy between cytometric measurements and current long-read genome assemblies (Asm): genome assemblies average 12% below Cym measured sizes, differing in amounts of duplicated content. This report examines five DNA read types to see if they can be used for more precise and reliable discrimination of major genome contents and sizes. The read types are short, accurate Illumina, long Pacific Biosciences, of low and high accuracy, and long Oxford Nanopore Technology of low and high accuracy. Gnodes is the measurement tool used, which maps DNA to assembly, and measures DNA copy numbers for major genome contents of genes, transposons, repeats, and others, using as a measurement unit the single copies of unique conserved genes. Public data of five well studied genomes, human, corn, zebrafish, sorghum and rice, are used for the primary evidence of this work.
Results of this are mixed and open to interpretations: In broad terms, all DNA types measure about the same genome contents, at or below 90% agreement, which is a level that the other complications can contribute. For precision above a 90% level, long read types differ in supporting larger cytometric sizes (low accuracy reads), or smaller assembly sizes (high accuracy reads), with accurate short-reads roughly between. The weight of evidence suggests that low accuracy long reads are less biased for genome measurement, that high accuracy long reads have a bias of reduced duplications introduced by computational averaging or filtering. The several complicating factors noted can produce discrepancies larger than this average Cym - Asm difference, and are a problem to control.