A Concentration-Invariant FTIR Chemometric Workflow with Peak-Sparse Representation and Machine-Learning Classification
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Fourier-transform infrared (FTIR) spectroscopy is a widely utilized analytical technique for qualitative identification in chemical, environmental, and industrial contexts. Variability in sample concentration and operator-dependent preprocessing can compromise the reproducibility of chemometric workflows. This research presents a concentration-invariant FTIR preprocessing and classification framework that incorporates Savitzky–Golay smoothing, asymmetric least-squares baseline correction, area normalization, and a percentile-based peak-sparse representation. Principal component analysis (PCA) is applied to the sparse spectra to generate a compact vibrational feature space, which is then used to train four supervised classifiers: PLS-DA, Random Forest, XGBoost, and Support Vector Machines. With a library of 89 pure organic compounds measured at four concentration levels, all models achieve macro-F1 scores between 0.97 and 1.00 under replicate-stratified evaluation, indicating strong robustness to concentration-driven spectral variation. The workflow is implemented in a lightweight Python/PyQt5 tool that enables real-time prediction and supports deployment in analytical laboratories and industrial quality-control settings. This study offers a transparent and reproducible chemometric framework that may serve as a basis for future extensions to complex mixtures and real-world sample matrices.