Machine Learning Approach to Integrate and Analyse Multiomics data to Identify Actionable Biomarkers for Head and Neck Squamous Cell Carcinoma (HNSCC)
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Head and neck squamous cell carcinoma (HNSCC) is ranked sixth among all the common cancers worldwide and is a major cause of death. A molecular understanding of disease progression can aid in timely diagnosis and therapy. This study aims to identify potential HNSCC biomarkers using a machine learning-based approach to integrate and analyse multi-omics data (namely publicly available Human Papillomavirus (HPV) negative patients’ multiomics datasets from the CPTAC-HNSCC project, including transcriptomics, methylomics, proteomics, and phosphoproteomics). A three-step feature selection method was utilized to identify potential molecular biomarkers using machine learning algorithms. The top 1000 important features (genes) were filtered using Mutual Information, followed by a random forest-based feature importance ranking, and Recursive Feature Elimination with cross-validation coupled with Support Vector Machine (SVM-RFECV) to get a minimal gene set important for machine learning based tumor-normal classification task. To benchmark these top-selected features, Logistic Regression (LogR), Random Forest (RF), Multi-layer perceptron (MLP), and Support Vector Machines (SVC) were used. The prediction performance of classifiers trained on these selected gene sets was evaluated using the accuracy metric, which was then compared against that of models trained on randomly selected gene sets. The entire workflow was repeated 100 times for different random states to establish statistical confidence in the pipeline and the selected gene set. Our integrative approach identified both omics-specific and cross-omics candidate genes with very high classification accuracy, ranging from ∼ 95% to 100%. These genes reveal convergent biological processes central to HNSCC pathogenesis, which reinforces the robustness of the methodology used, which can be adopted to analyse similar multiomics datasets for other pathologies and foundational biological questions.