MuLan-Methyl - Multiple Transformer-based Language Models for Accurate DNA Methylation Prediction

Wenhuan Zeng
Anupam Gautam
Daniel H. Huson

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (GigaScience)

Abstract

Transformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source and we provide a web server that implements the approach.

Key points

MuLan-Methyl aims at identifying three types of DNA-methylation sites.
It uses an ensemble of five transformer-based language models, which were pre-trained and fine-tuned on a custom corpus.
The self-attention mechanism of transformers give rise to importance scores, which can be used to extract motifs.
The method performs favorably in comparison to existing methods.
The implementation can be applied to chromosomal sequences to predict methylation sites.

GigaScience
Aug 21, 2023
AbstractTransformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is …
AbstractTransformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source and we provide a web server that implements the approach.Key points

**Reviewer 2. Jianxin Wang **

In this manuscript, the authors present MuLan-Methyl, a deep-learning framework for predicting 6mA, 4mC, and 5hmC sites. They use DNA sequence and taxonomic identity as features, and implement five popular transformer-based language models in MuLan-Methyl. MuLan-Methyl is open-sourced, and a web server is also provided for users to access it. Overall, I think the methodology of MuLan-Methyl is clear and innovative, and the experiments seem comprehensive. However, I do have several concerns that I believe should be addressed before the paper is accepted by GigaScience.

Major

One major concern is that, in my opinion, DNA methylation is dynamic. Cytosines in the same position of the DNA sequence may have different methylation status in different samples, different cells, or even in different development stages of a cell. So, how can we predict the methylation status of a site based on only its sequence (and taxonomic identity)? -- The authors should clarify that in what cases, MuLan-Methyl (as well as other methods that use only DNA sequence) can be used to study DNA methylation, in Introduction or Discussion section. -- The authors discuss motifs in Fig. 3, but only for positive samples. How about the motif distribution in the negative samples? Can I understand that this method is actually for discovering motifs (or sequence structures) that are highly correlated with methylation? -- How is the performance of MuLan-Methyl without taxonomic identity?

The authors compared MuLan-Methyl against iDNA-ABF and iDNA-ABT, especially on the independent test set (Fig. 2E). I think the authors should clarify that whether they trained the models of the three methods using the same training datasets. If not, the authors should clarify the reason.

I'm curious about the computational efficiency of MuLan-Methyl. How many parameters in its model? Does MuLan-Methyl have advantages over other methods in terms of computational efficiency?

Minor

I don't understand why the references were not ordered from 1 in the main text.

I suggest that the authors re-organize the Introduction section. There are too many small paragraphs in this section.

At the end of Page 2, "The type 4mC type is present in 4 species" should be corrected.

Re-review:

The authors have addressed most of my concerns. However, I still have one minor concern about the computational efficiency. The response of the authors is not convincing by only saying "The number of models that MuLan-Methyl need to train and test on is less than the others, thus it has better computational efficiency than other models to some extent". If possible, I strongly suggest that the authors show some data to compare how much time and resources (GPU/CPU/RAM) needed by each method. The authors have addressed most of my concerns. However, I still have one minor concern about the computational efficiency. The response of the authors is not convincing by only saying "The number of models that MuLan-Methyl need to train and test on is less than the others, thus it has better computational efficiency than other models to some extent". If possible, I strongly suggest that the authors show some data to compare how much time and resources (GPU/CPU/RAM) needed by each method.
Read the original source
GigaScience
Aug 21, 2023
AbstractTransformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is …
AbstractTransformer-based language models are successfully used to address massive text-related tasks. DNA methylation is an important epigenetic mechanism and its analysis provides valuable insights into gene regulation and biomarker identification. Several deep learning-based methods have been proposed to identify DNA methylation and each seeks to strike a balance between computational effort and accuracy. Here, we introduce MuLan-Methyl, a deep-learning framework for predicting DNA methylation sites, which is based on five popular transformer-based language models. The framework identifies methylation sites for three different types of DNA methylation, namely N6-adenine, N4-cytosine, and 5-hydroxymethylcytosine. Each of the employed language models is adapted to the task using the “pre-train and fine-tune” paradigm. Pre-training is performed on a custom corpus of DNA fragments and taxonomy lineages using self-supervised learning. Fine-tuning aims at predicting the DNA-methylation status of each type. The five models are used to collectively predict the DNA methylation status. We report excellent performance of MuLan-Methyl on a benchmark dataset. Moreover, we argue that the model captures characteristic differences between different species that are relevant for methylation. This work demonstrates that language models can be successfully adapted to applications in biological sequence analysis and that joint utilization of different language models improves model performance. Mulan-Methyl is open source and we provide a web server that implements the approach.

This work has been published in GigaByte Journal under a CC-BY 4.0 license (https://doi.org/10.1093/gigascience/giad054) and has published the reviews under the same license. These are as follows.

**Reviewer 1. Yupeng Cun **

Zeng et al. proposed an ensemble framework for identifying three type DNA-methylation sites, and performed a benchmark comparison in multiple species' genomic data. This paper give a valuable study on how ensemble transfer learners works and the predictability in different species. My suggestion is the manuscript acceptable with following minor revision:

Calculated a consensus ranking using Kendall's tau rank distance method for each method in Figure 2-C.

the multi-head self- attention and self-attention head formula should redescribed by following this preprint: https://arxiv.org/pdf/1706.03762.pdf

MLM and MuLan-Methyl mixed in some cases, which need be used in a consensus way.
Read the original source
Version published to 10.1101/2023.01.04.522704v2 on bioRxiv
Jun 1, 2023
Version published to 10.1101/2023.01.04.522704v1 on bioRxiv
Jan 4, 2023
Version published to 10.1093/gigascience/giad054
Dec 28, 2022

GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

This article has 3 authors:
1. Shashi Dhanasekar
2. Akash Saranathan
3. Pengtao Xie
This article has no evaluationsLatest version Jun 6, 2025
Parameter-Efficient Fine-Tuning of a Supervised Regulatory Sequence Model

This article has 3 authors:
1. Han Yuan
2. Johannes Linder
3. David R Kelley
This article has no evaluationsLatest version May 31, 2025
Hybrid BiGRU-BiLSTM Model for Robust Gene Sequence Classification: Leveraging K-Mer Preprocessing and Comprehensive Evaluation

This article has 2 authors:
1. pranjali malviya
2. Pon Harshavardhanan
This article has no evaluationsLatest version Jun 26, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Key points

Article activity feed

Related articles

GeneChat: A Multi-Modal Large Language Model for Gene Function Prediction

Parameter-Efficient Fine-Tuning of a Supervised Regulatory Sequence Model

Hybrid BiGRU-BiLSTM Model for Robust Gene Sequence Classification: Leveraging K-Mer Preprocessing and Comprehensive Evaluation