Imputing missing RNA-sequencing data from DNA methylation by using a transfer learning–based neural network

Xiang Zhou
Hua Chai
Huiying Zhao
Ching-Hsing Luo
Yuedong Yang

This article has been Reviewed by the following groups

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

Evaluated articles (GigaScience)

Abstract

Background

Gene expression plays a key intermediate role in linking molecular features at the DNA level and phenotype. However, owing to various limitations in experiments, the RNA-seq data are missing in many samples while there exist high-quality of DNA methylation data. Because DNA methylation is an important epigenetic modification to regulate gene expression, it can be used to predict RNA-seq data. For this purpose, many methods have been developed. A common limitation of these methods is that they mainly focus on a single cancer dataset and do not fully utilize information from large pan-cancer datasets.

Results

Here, we have developed a novel method to impute missing gene expression data from DNA methylation data through a transfer learning–based neural network, namely, TDimpute. In the method, the pan-cancer dataset from The Cancer Genome Atlas (TCGA) was utilized for training a general model, which was then fine-tuned on the specific cancer dataset. By testing on 16 cancer datasets, we found that our method significantly outperforms other state-of-the-art methods in imputation accuracy with a 7–11% improvement under different missing rates. The imputed gene expression was further proved to be useful for downstream analyses, including the identification of both methylation–driving and prognosis-related genes, clustering analysis, and survival analysis on the TCGA dataset. More importantly, our method was indicated to be useful for general purposes by an independent test on the Wilms tumor dataset from the Therapeutically Applicable Research to Generate Effective Treatments (TARGET) project.

Conclusions

TDimpute is an effective method for RNA-seq imputation with limited training samples.

GigaScience
Jan 24, 2022

Now published in GigaScience doi: 10.1093/gigascience/giaa076

Xiang Zhou 1School of Data and Computer Science, Sun Yat-sen University, Guangzhou, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHua Chai 1School of Data and Computer Science, Sun Yat-sen University, Guangzhou, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHuiying Zhao 2Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteChing-Hsing Luo 1School of Data and Computer Science, Sun Yat-sen University, Guangzhou, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYuedong Yang 1School of Data and Computer …

Now published in GigaScience doi: 10.1093/gigascience/giaa076

Xiang Zhou 1School of Data and Computer Science, Sun Yat-sen University, Guangzhou, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHua Chai 1School of Data and Computer Science, Sun Yat-sen University, Guangzhou, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteHuiying Zhao 2Sun Yat-sen Memorial Hospital, Sun Yat-sen University, Guangzhou, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteChing-Hsing Luo 1School of Data and Computer Science, Sun Yat-sen University, Guangzhou, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteYuedong Yang 1School of Data and Computer Science, Sun Yat-sen University, Guangzhou, China3Key Laboratory of Machine Intelligence and Advanced Computing (Sun Yat-sen University), Ministry of Education, ChinaFind this author on Google ScholarFind this author on PubMedSearch for this author on this siteORCID record for Yuedong YangFor correspondence: yangyd25@mail.sysu.edu.cn

A version of this preprint has been published in the Open Access journal GigaScience (see paper https://doi.org/10.1093/gigascience/giaa076 ), where the paper and peer reviews are published openly under a CC-BY 4.0 license.

These peer reviews were as follows:

Reviewer 1: http://dx.doi.org/10.5524/REVIEW.102310 Reviewer 2: http://dx.doi.org/10.5524/REVIEW.102311

Read the original source
Version published to 10.1093/gigascience/giaa076
Jul 1, 2020
Version published to 10.1101/803692 on bioRxiv
Oct 13, 2019

Uncovering miRNA–Disease Associations Through Graph Based Neural Network Representations

This article has 1 author:
1. Alessandro Orro
This article has no evaluationsLatest version Jan 28, 2026
Integrated transcriptomic and machine learning-driven analysis reveals high-confidence circular RNA biomarkers in Lung Adenocarcinoma

This article has 2 authors:
1. Ayushi Malviya
2. Rajabrata Bhuyan
This article has no evaluationsLatest version Feb 19, 2026
Integrative Bioinformatics Analysis Unveils Neuro-cancer Crosstalk- related Genes and Establishes Prognostic Risk Model in Glioblastoma

This article has 7 authors:
1. Lin Zeng
2. Dingjun Li
3. Mengyu Du
4. Tao Wu
5. Yun Liao
6. Yuxing Huang
7. Xingyu Liao
This article has no evaluationsLatest version Jan 12, 2026

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Background

Results

Conclusions

Article activity feed

Related articles

Uncovering miRNA–Disease Associations Through Graph Based Neural Network Representations

Integrated transcriptomic and machine learning-driven analysis reveals high-confidence circular RNA biomarkers in Lung Adenocarcinoma

Integrative Bioinformatics Analysis Unveils Neuro-cancer Crosstalk- related Genes and Establishes Prognostic Risk Model in Glioblastoma