Harmonizing Heterogeneous Datasets: Imputation and quality control for multi genotyping platform and multi-breed LD SNP panel analysis in Cattle

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Low-density (LD) SNP chips are widely used in cattle genomics for applications such as breed characterization, genomic selection, and breeding value estimation. However, differences in SNP content across chip versions and missing genotypes can limit the utility of these datasets. Imputation techniques, using a unified reference panel, can address these challenges by improving data completeness and enabling cross-platform compatibility. In this study, 30,124 cattle DNA samples representing five breeds—Gir, Sahiwal, Kankrej, crossbred Holstein Friesian (CBHF), and crossbred Jersey (CBJY)—were genotyped using various versions of INDUSCHIP, a low-density SNP chip, across Illumina and Affymetrix platforms. To ensure data compatibility, genotype data underwent rigorous quality control and were standardized onto a unified reference panel. Missing genotypes were imputed using this panel, and masked analysis demonstrated an average concordance of 94.56% between genotyped and imputed data. Further evaluation using the Dosage R Square (DR2) metric showed that most imputed SNPs achieved DR2 scores above 0.75, indicating high imputation accuracy and reliability across all breeds. The imputed dataset generated in this study provides a robust and harmonized genomic resource for cattle breeding programs. This resource supports critical applications such as breed purity assessment, genomic selection, and breeding value estimation, enhancing the accuracy and efficiency of genetic improvement initiatives.

Article activity feed