A tailored variant filtering procedure for multi-breed and multi-species unbalanced animal SNP collections

Barbara Lazzari
Marco Milanesi
Andrea Talenti
Arianna Bionda
Yefang Li
Lin Jiang
Philippe Bardou
Gwenola Tosser Klopp
Paola Crepaldi
Licia Colli

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Technological advancements and decrease of costs of whole-genome sequencing approaches has made available a huge and ever increasing amount of resequencing data for many species. It is thus now possible to assemble large sized datasets encompassing the molecular variation of several species and/or populations or breeds. Nonetheless, these datasets can be extremely variable in terms of geographical provenance and sample sizes, with taxonomic groups varying from hundreds to just a few or even one single entry. In such circumstances, the application of standard filtering approaches may lead to the introduction of biases and to the under/over representation of some groups or gene pools. Commonly adopted variant filtering approaches relying on Minor Allele Frequency (MAF) and Linkage Disequilibrium (LD) may not be suitable to treat datasets representing broadscale diversity of multiple species, due to remarkable differences in LD structure and in the frequency of variants at the local vs. global scale. Thus, by exploiting the VarGoats 1000 goat genome project data as an optimal case study, we devised a novel approach based on within-population subsampling, Minor Allele Count (MAC) and marker spacing (bp-space), specifically designed to avoid biases introduced by standard filtering procedures and to adequately represent continental and species-specific variation. Starting from a quality-filtered dataset of >28M SNPs from 1372 animals, we obtained a dataset of <14M markers and 750 individuals, complying with the initial requirements and more handy for further computational steps. The dataset was validated by PCA, Neighbor Joining and Admixture analyses.

Version published to 10.1101/2025.10.14.682050 on bioRxiv
Oct 15, 2025

Genetic estimates of relatedness: Established practices and new opportunities through low coverage whole genome sequencing

This article has 8 authors:
1. Annika Freudiger
2. Natalie Kestel
3. Vladimir Jovanovic
4. Mariana Madruga de Brito
5. Angelina Ruiz-Lambides
6. Katja Nowick
7. Anja Widdig
8. Harald Ringbauer
This article has no evaluationsLatest version Jan 23, 2026
Comparison of BLUPF90IOD3 and MiXBLUP implementations of the single-step model applied to the Polish national dairy cattle evaluation

This article has 4 authors:
1. Dawid Słomian
2. Michalina Jakimowicz
3. Tomasz Suchocki
4. Joanna Szyda
This article has no evaluationsLatest version Dec 22, 2025
Genome-wide prediction and association mapping of potato common scab with historical data

This article has 10 authors:
1. Fatima Latif Azam
2. Matthijs Brouwer
3. David Douches
4. Joseph Coombs
5. Amber Walker
6. Maria Caraza-Harter
7. Dan Milbourne
8. Denis Griffin
9. Herman J. van Eck
10. Jeffrey B. Endelman
This article has no evaluationsLatest version Jan 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Genetic estimates of relatedness: Established practices and new opportunities through low coverage whole genome sequencing

Comparison of BLUPF90IOD3 and MiXBLUP implementations of the single-step model applied to the Polish national dairy cattle evaluation

Genome-wide prediction and association mapping of potato common scab with historical data