Integrative Transcriptomics and Machine Learning Identify Key Predictive Genes and Pathways in Celiac Disease

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Celiac disease (CD) is a T-cell–mediated disorder triggered by gluten ingestion, characterized by chronic intestinal inflammation and a complex genetic architecture involving HLA and non-HLA loci. Despite extensive genomic studies, the transcriptional dysregulation underlying CD pathogenesis and predictive molecular signatures remain incompletely understood. Methods: We performed genome-wide RNA sequencing of intestinal biopsies from CD patients and healthy controls to profile transcriptomic alterations. Dimensionality reduction methods, including Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP), were applied to visualize global expression differences. Differential expression analysis identified genes with significant log2 fold changes. Functional enrichment of differentially expressed genes was performed using Gene Ontology (GO) Biological Process and KEGG pathway analyses. To prioritize disease-relevant genes, multiple machine learning classifiers—Random Forest, Logistic Regression, and Support Vector Machine—were trained, and top features were ranked by model-specific importance metrics. Overlapping predictive genes were assessed for concordance with differential expression and pathway enrichment. An ensemble XGBoost model was subsequently trained on cross-model prioritized genes and evaluated using ROC–AUC analysis. Results: Dimensionality reduction revealed distinct separation between CD and control transcriptomes, indicating widespread transcriptional dysregulation. Volcano plots identified upregulation of immune-related genes and downregulation of metabolic and epithelial genes. Functional enrich- ment highlighted perturbation of immune, metabolic, and epithelial pathways. Feature importance analyses across ML models consistently identified immune, epithelial, and metabolic genes as predictive, with a core overlapping gene set validated by differential expression and pathway analysis. The XGBoost classifier achieved superior discriminative performance compared with individual models, demonstrating high ROC–AUC values. Conclusion: Integrating transcriptomics with multi-model machine learning reveals key molecular drivers of CD, identifies a robust core set of predictive genes, and establishes a framework for biomarker discovery and risk stratification in celiac disease

Article activity feed