An Investigation into Reproducibility and Performance in Bioinformatics Software: A Case Study of BLAST+ and Floating-Point Arithmetic
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This report is on the reproducibility and performance of bioinformatics software, with a specific focus on versions of the widely-used Basic Local Alignment Search Tool (BLAST+) suite. The core challenges addressed are the inconsistencies arising from floating-point arithmetic implementations across different C/C++ compilers and hardware architectures, alongside the identification and analysis of software performance bottlenecks. Any investigation of this problem should be focused on three primary areas: first, an empirical testing and documentation of the reproducibility of BLAST+ outputs generated under varying compilation and execution environments, with particular attention to floating-point mathematical discrepancies; second, the identification and characterization of performance bottlenecks within the BLAST+ codebase using established profiling tools and analysis; and third, an exploration and preliminary evaluation of optimization strategies, including code reordering, the use of alternative mathematical functions, and any application of machine-specific instruction sets (e.g., SIMD). These recommendations are based on the goals of reliability of scientific results derived from bioinformatics tools and the computational costs associated with large-scale biological sequence analysis. This report further discusses the advantages of open-source development paradigms and delves into the technical intricacies of floating-point arithmetic, including considerations for 32-bit versus 64-bit builds and the historical context of legacy software, that underpin the identified challenges.