Evaluating Limits of Machine Learning-Assisted Raman Spectroscopy in Classification of Biological Samples

This article has been Reviewed by the following groups

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Log in to save this article

Abstract

Machine learning (ML)-assisted Raman spectroscopy has become a powerful analytical tool for the classification and identification of analytes; however, technical challenges impacting its detection accuracy have not been investigated. This study explores experimental factors affecting classification performance. Among the evaluated ML models, ML algorithms show minimal impacts on classification accuracy. Instead, experimental factors, including spectral similarity between tested samples and the data quality, dominate detection performance. Increases in spectral noises and spectral similarity significantly reduce classification accuracy. In well-controlled samples with low experimental noise, ML-assisted Raman spectroscopy can discriminate lipid mixtures with a composition difference of 1.85 mol%. To assess the effect of biological heterogeneity, we analyzed single-cell Raman spectra from Saccharomyces cerevisiae strains carrying single, double, or triple gene mutations. Intrinsic cell-to-cell variability introduced substantial spectral differences, severely reducing the accuracy of multiclass classification of these genetically similar strains at the single-cell level. Averaging Raman spectra across multiple cells improved classification accuracy by reducing this spectral variability. We also assess the effectiveness of transfer learning across different Raman spectrometers, specifically by applying a ML model trained on one instrument to another Raman spectrometer. Transfer learning can be improved with proper instrument calibration, highlighting the importance of instrument standardization. Overall, our results demonstrate that data quality and spectral similarity are the primary bottlenecks in ML-assisted Raman spectroscopy. Careful attention to sample preparation, data acquisition, measurement conditions, and instrument calibration is critical to achieving robust and reliable classification performance.

Article activity feed

  1. First, principal component analysis (PCA) was applied to reduce data complexity.

    Have you explored using NMF (non-negative matrix factorization) for analyzing Raman spectra? There's some recent work comparing the use of MCR and NMF, since they enforce non-negative component vectors and might be better aligned/possibly interpretable for Raman spectroscopy. (https://doi.org/10.1016/j.aca.2025.344755)