An open-source, configurable machine learning pipeline for predicting blood culture outcomes from routine haematology parameters
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Bloodstream infections remain a major cause of morbidity and mortality worldwide, yet blood culture positivity rates are typically low, highlighting a need to optimise test use. Machine learning models trained on routinely available haematology parameters have shown promise for predicting blood culture outcomes. However, there is a lack of open-source software to implement methods described in the peer-reviewed literature. Results We present an open-source pipeline for training, evaluation, and reporting of binary classification models that predict blood culture outcomes from complete blood count (CBC), white blood cell differential (DIFF), and cell population data (CPD) generated by Sysmex XN-series haematology analysers. This pipeline implements four classifier types (logistic regression, decision tree, random forest, and XGBoost), two default feature spaces (19-feature CBC/DIFF and 50-feature CBC/DIFF/CPD), and feature selection methods (Boruta all-relevant selection and recursive feature elimination), along with nested cross-validation (CV) to prevent data leakage during feature selection. Trained logistic regression coefficients and decision tree rules were exported in portable formats suited to deployment in spreadsheet-based or laboratory information management system (LIMS) environments without requiring the python programming language runtime. Random forest and XGBoost models were also exported. The pipeline was fully configurable via a single JSON file, and allowed adaptation to any binary classification problem without source code modification. Automated HTML reports with embedded area under the receiver operating characteristic curves and confusion matrices were generated for both training and inference runs. Conclusions This open-source repository addresses key limitations in existing blood culture outcome prediction workflows by providing a reproducible, transparent method, and clinically deployable pipeline. Its configurable architecture, nested CV strategy, multiple feature selection methods, and export of interpretable model artefacts make it suitable for both research and clinical decision support applications.