Multiblock LASSO Framework for Cancer Gene Selection from RNA-Seq PANCAN Data

Zeeshan Ashraf
Muhammad Aslam
Tahir Mehmood
Laila Abdulaziz Abdulrahman Al-Essa

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The Cancer RNA-HiSeq PANCAN dataset consists of RNA-Seq gene expression data collected from multiple cancer types. It is a high-dimensional dataset, meaning it has thousands of gene expression features (predictors) and relatively fewer samples (observations). The dataset contains thousands of genes, making it difficult to identify key biomarkers. In order to reduce data and comprehend the modeled link, variable selection is essential. Least Regression using Absolute Shrinkage and Selection Operator (LASSO) is one modeling technique that deals with high throughput data. The data might be divided into different blocks representing different biological pathways or cancer types. Many genes are correlated, which can reduce interpretability. In many areas, including modern biology, variable selection is an important problem. For instance, choosing genetic characteristics for categorization (i.e., identifying harmful bacteria, diagnosing diseases, etc.) is an example of this. Multiblock Lasso (a variant of Lasso regression) is particularly useful when data is structured into blocks (e.g., different biological processes or pathways). It helps in selecting important features across multiple blocks, improving interpretability by grouping related genes, reducing over fitting in high-dimensional datasets. In this study, we apply Multiblock Lasso to extract significant gene features for cancer classification. We preprocess the dataset, define block structures using biological pathways, and optimize the regularization parameters using cross-validation. Experimental results demonstrate that Multiblock Lasso effectively reduces dimensionality while maintaining classification accuracy, making it a powerful tool for biomarker discovery in cancer genomics.

Version published to 10.1101/2025.07.05.663323 on bioRxiv
Jul 9, 2025

Multiomics and Machine Learning Identify Prognostic Immune Related Gene Signatures in Ovarian Cancer

This article has 4 authors:
1. Xiulan Wang
2. Xuewang Guo
3. Yanying Xu
4. Shaofang Hua
This article has no evaluationsLatest version Dec 18, 2025
Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

This article has 2 authors:
1. Rafael de Negreiros Botan
2. Joao Batista de Sousa
This article has no evaluationsLatest version Dec 19, 2025
Multi-Omic Integration and Machine Learning Reveal Regulatory Networks Driving Breast Cancer Progression

This article has 2 authors:
1. Unmilita Das Moon
2. Kushal Raj Roy
This article has no evaluationsLatest version Dec 11, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Multiomics and Machine Learning Identify Prognostic Immune Related Gene Signatures in Ovarian Cancer

Cross-Platform Reproducible Modeling of Breast Cancer Prognosis Using the Core-PAM50 Gene Signature

Multi-Omic Integration and Machine Learning Reveal Regulatory Networks Driving Breast Cancer Progression