CBIcall: a configuration-driven framework for variant calling in large sequencing cohorts
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Motivation
Variant calling for next-generation sequencing (NGS) data relies on a diverse ecosystem of tools and workflows. Large-scale collaborative studies increasingly adopt federated analysis, where each institution processes sensitive data locally using standardized pipelines. Deploying identical pipelines across multiple centers remains challenging because heterogeneous software environments and computing policies can cause workflow divergence and inconsistent results.
Results
We developed CBIcall, a workflow-agnostic, configuration-driven framework that runs standardized variant-calling pipelines from raw FASTQ files to analysis-ready VCFs using a single YAML file. An execution driver validates user parameters, enforces compatibility across pipelines, analysis modes, work-flow backends, genome builds, and tool versions, and records structured provenance for each run, ensuring consistent and reproducible pipeline execution across computing environments. CBIcall dispatches validated workflows through Bash or Snakemake backends and provides production-ready pipelines for germline WES, WGS (single-sample or cohort joint genotyping following GATK Best Practices), and mitochondrial DNA analysis. We validated CBIcall on public datasets and deployed it in the EU HEREDITARY project, processing 1,111 samples with both WES and mtDNA pipelines on an institutional HPC system, demonstrating its suitability for reproducible cohort-scale genomic analyses.
Availability and implementation
CBIcall is open source (GPLv3) and distributed with ready-to-run pipelines; full dependency and installation documentation is available at https://github.com/CNAG-Biomedical-Informatics/cbicall .