assessPool: a flexible pipeline for population genomic analyses of pooled sequencing data
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Despite the dramatic decrease in high-throughput sequencing costs over time, sequencing the ideal number of individuals for population genetic inference remains prohibitively expensive. When research questions require only population-level resolution, pooling individual samples before sequencing (pool-seq) can substantially reduce costs while still providing allele frequencies of Single Nucleotide Polymorphisms (SNPs). However, analyzing pooled data is comparatively difficult and less standardized than individual-based analyses. Although several programs have been developed to handle pool-seq data, most require extensive formatting or programming skills to operate. Here we introduce assessPool, an open-source R and Bash pipeline for pool- seq analyses with a focus on population structure. AssessPool accepts a Variant-Call Format (VCF) file and a FASTA-formatted reference, providing a straightforward transition from commonly used pipelines such as Stacks or dDocent. AssessPool handles varying numbers of pools and utilizes PoPoolation2 to generate locus-by-locus pairwise F ST values and associated Fisher T-test values as measures of population structure. Starting with a VCF file containing all identified SNPs, assessPool facilitates several key functionalities for population genetic analyses: i) filtering SNPs based on adjustable criteria with parameter suggestions for pool-seq data, ii) organizing data structures for analysis based on input pools, iii) creating customizable run scripts for F ST calculations using PoPoolation2 and/or the {poolfstat} R package, for all pairwise comparisons, iv) calculating locus-specific F ST values using PoPoolation2 and/or {poolfstat}, v) importing F ST output into a format compatible with R, vi) producing population genomic summary statistics, and vii) generating interactive plots to visualize and explore data. A pooled dataset generated from wild populations is used here to showcase the features of the assessPool pipeline for population genomic analyses.