A tailored variant filtering procedure for multi-breed and multi-species unbalanced animal SNP collections

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Technological advancements and decrease of costs of whole-genome sequencing approaches has made available a huge and ever increasing amount of resequencing data for many species. It is thus now possible to assemble large sized datasets encompassing the molecular variation of several species and/or populations or breeds. Nonetheless, these datasets can be extremely variable in terms of geographical provenance and sample sizes, with taxonomic groups varying from hundreds to just a few or even one single entry. In such circumstances, the application of standard filtering approaches may lead to the introduction of biases and to the under/over representation of some groups or gene pools. Commonly adopted variant filtering approaches relying on Minor Allele Frequency (MAF) and Linkage Disequilibrium (LD) may not be suitable to treat datasets representing broadscale diversity of multiple species, due to remarkable differences in LD structure and in the frequency of variants at the local vs. global scale. Thus, by exploiting the VarGoats 1000 goat genome project data as an optimal case study, we devised a novel approach based on within-population subsampling, Minor Allele Count (MAC) and marker spacing (bp-space), specifically designed to avoid biases introduced by standard filtering procedures and to adequately represent continental and species-specific variation. Starting from a quality-filtered dataset of >28M SNPs from 1372 animals, we obtained a dataset of <14M markers and 750 individuals, complying with the initial requirements and more handy for further computational steps. The dataset was validated by PCA, Neighbor Joining and Admixture analyses.

Article activity feed