Disease classification for whole blood DNA methylation: meta-analysis, missing values imputation, and XAI

Alena Kalyakulina
Igor Yusipov
Maria Giulia Bacalini
Claudio Franceschi
Maria Vedunova
Mikhail Ivanchenko

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (GigaScience)

Abstract

Background

DNA methylation has a significant effect on gene expression and can be associated with various diseases. Meta-analysis of available DNA methylation datasets requires development of a specific pipeline for joint data processing.

Results

We propose a comprehensive approach of combined DNA methylation datasets to classify controls and patients. The solution includes data harmonization, construction of machine learning classification models, dimensionality reduction of models, imputation of missing values, and explanation of model predictions by explainable artificial intelligence (XAI) algorithms. We show that harmonization can improve classification accuracy by up to 20% when preprocessing methods of the training and test datasets are different. The best accuracy results were obtained with tree ensembles, reaching above 95% for Parkinson’s disease. Dimensionality reduction can substantially decrease the number of features, without detriment to the classification accuracy. The best imputation methods achieve almost the same classification accuracy for data with missing values as for the original data. Explainable artificial intelligence approaches have allowed us to explain model predictions from both populational and individual perspectives.

Conclusions

We propose a methodologically valid and comprehensive approach to the classification of healthy individuals and patients with various diseases based on whole blood DNA methylation data using Parkinson’s disease and schizophrenia as examples. The proposed algorithm works better for the former pathology, characterized by a complex set of symptoms. It allows to solve data harmonization problems for meta-analysis of many different datasets, impute missing values, and build classification models of small dimensionality.

GigaScience
Feb 17, 2023

Background

This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac097 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

**Reviewer name: Giulia De Riso **

In this study, a workflow is presented to generate classification models from DNA methylation data. Methods to deal with harmonization and missing data imputation are presented and the benefit of adopting them for classification tasks is tested on case-control datasets of schizophrenia and Parkinson disease. The authors support this workflow with source code. Although mostly based on already known methodologies, the present study may help orient studies aimed at building and applying DNA methylation based models. However, some major concerns can be raised:

Majors: In …

Background

This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac097 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

**Reviewer name: Giulia De Riso **

In this study, a workflow is presented to generate classification models from DNA methylation data. Methods to deal with harmonization and missing data imputation are presented and the benefit of adopting them for classification tasks is tested on case-control datasets of schizophrenia and Parkinson disease. The authors support this workflow with source code. Although mostly based on already known methodologies, the present study may help orient studies aimed at building and applying DNA methylation based models. However, some major concerns can be raised:

Majors: In different points of the manuscript, the authors refer to their approach as a pipeline. Indeed, this approach should be composed of sequential modules, in which the output of a module becomes the input of the next one. Although the modules are clearly distinguishable, their organization in the pipeline is less straightforward (also considering that modules can be adopted both to build a model and to use it on new data). The authors could think to draw a scheme of the pipeline, or to adopt a different term to refer to the presented approach. From the model performance perspective, the ML models poorly perform for schizophrenia. The authors point to inner characteristics of the disease as a possible reason for this. However, this point should be better commented in the Discussion section.

Besides this, the impact of the smaller number of samples included in the training set and the higher proportion of imputed features compared to Parkinson disease on the classification accuracy should be discussed. In addition, since the authors provided the code, is there a way to select samples to include in training/test sets based on random choice (classical 70-30% splitting) instead of source dataset? "For machine learning models, we used only those CpG sites that have the same distribution of methylation levels in different datasets in the control group (methylation levels in the case group typically have greater variability because of disease heterogeneity).": is this filtering performed only on the datasets included in the training set, or also on the test set? It seems the former, but the authors should clearly state this point. Accuracy with weighted averaging should be defined with a formula in the methods section Regarding the ML models, the authors chose different types of decision-trees ensemble, along with a deep learning one. They should contextualize this choice (why different models from the same family?).

In addition, ML models built on DNA methylation are often based on elastic net or Support-Vector Machines, which are not accounted for in this work. The authors should comment on this aspect in limitations, and state whether the code they provided for their approach could be customized to adopt different models from the ones they presented.

Regarding the Imputation Method column in Table 2, the meaning is not clear. Are the different imputation methods described in the Imputation of missing values section paired with the ML models presented in Table 2? If yes, some of the methods (like KNN) are missing. In the harmonization section, Models for case-control classification are trained on different numbers and sets of CpGs. To assess the effect of harmonization alone, the number of CpGs should be instead fixed. This is especially critical for schizophrenia, when the number of features for the non-harmonized data is 35145 whereas the one for harmonized data is 110,137. Dimensionality reduction section: are the models from imputed and not-imputed data trained only on harmonized data? And how the set of 50911CpG sites for Parkinson and 110137 CpG sites for schizophrenia is selected?

Imputation of missing values section: it is not clear on which CpGs and on which samples imputation is performed. Also, it is not clear whether the imputation has been tested on the best-performing model.

Minors: Page 1, line 2: "DNA methylation is associated with epigenetic modification". DNA methylation is an epigenetic mark itself. Do the authors mean histone marks?

Page 1, from line 7: "DNA methylation consists of binding a methyl group to cytosine in the cytosineguanine dinucleotides (CpG sites). Hypermethylation of CpG sites near the gene promoter is known to repress transcription, while hypermethylation in the gene body appears to have an opposite, also less pronounced effect.": references should be added

Page 2, from line 2 : "Current epigenome-wide association studies (EWAS) test DNAm associations with human phenotypes, health conditions and diseases.": references should be added

Page 3: "In most cases, an increase in dimensionality does not provide significant benefits, since lower dimensionality data may contain more relevant information". This point could be presented in a reverse way (higher dimensionality data may contain redundant information), introducing the collinearity issue. In addition, this issue could be introduced before the missing values and imputation section.

Page 3: references for "Modern machine-l earning-based artificial intelligence systems are powerful and promising tools" could be more specific for the field of epigenetics and DNA methylation.

Read the original source
GigaScience
Feb 17, 2023
Abstract

This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac097 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

**Reviewer name: Liang Yu Reviewer **

Comments to Author: The paper by Kalyakulina et al. described the disease classification for whole blood DNA methylation. The author proposed a comprehensive approach of combined DNA methylation datasets to classify controls and patients. The solution includes data harmonization, construction of machine learning classification models, dimensionality reduction of models, imputation of missing values, and explanation of model predictions by explainable artificial intelligence algorithms. For Parkinson's disease and schizophrenia, the author also demonstrates that a method for …
Abstract

This work has been peer reviewed in GigaScience ( see https://doi.org/10.1093/gigascience/giac097 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

**Reviewer name: Liang Yu Reviewer **

Comments to Author: The paper by Kalyakulina et al. described the disease classification for whole blood DNA methylation. The author proposed a comprehensive approach of combined DNA methylation datasets to classify controls and patients. The solution includes data harmonization, construction of machine learning classification models, dimensionality reduction of models, imputation of missing values, and explanation of model predictions by explainable artificial intelligence algorithms. For Parkinson's disease and schizophrenia, the author also demonstrates that a method for classifying healthy individuals and patients with various disorders based on whole blood DNA methylation data is an efficient and comprehensive approach.

Overall, the manuscript is well organized. I have some suggestions for the authors to improve their work:

The manuscript has constructed different models for the prediction study of CpG sites for different types of data. It is suggested to add a flowchart of the whole model construction process to the manuscript so that readers can understand the study more clearly.

In Figure 4, the author only shows the top 10 important features and marks the highest accuracy and number of features with black lines in the figure. It is recommended to show the relevant data (optimal accuracy and number of features) in the figure. For the three subplots included in the figure, please label them separately, e.g., A, B, and C to indicate them separately.)

Remark concerns model performance evaluation: author should provide standard deviations of the obtained values.

In this manuscript, the author used graphs to present the results and suggested that a table summarizing the performance results of the model would be intuitive.

I didn't find how the authors optimize the hyper-parameters, usually using grid search.

The authors do not adequately address how their method outperforms existing methods in the discussion section.

The "Dimensionality reduction" section: I think this section is more appropriately called "feature selection", a sequence forward search method. First sort the features according to their importance values, then add or remove features from a candidate subset while evaluating the criterion
Read the original source
Version published to 10.1101/2022.05.10.491404v2 on bioRxiv
May 23, 2022
Version published to 10.1101/2022.05.10.491404v1 on bioRxiv
May 11, 2022
Version published to 10.1093/gigascience/giac097
Jan 1, 2022

Multiblock LASSO Framework for Cancer Gene Selection from RNA-Seq PANCAN Data

This article has 4 authors:
1. Zeeshan Ashraf
2. Muhammad Aslam
3. Tahir Mehmood
4. Laila Abdulaziz Abdulrahman Al-Essa
This article has no evaluationsLatest version Jul 9, 2025
Advancing CNS tumor diagnostics with expanded DNA methylation-based classification

This article has 121 authors:
1. Martin Sill
2. Daniel Schrimpf
3. Areeba Patel
4. Dominik Sturm
5. Natalie Jäger
6. Philipp Sievers
7. Leonille Schweizer
8. Rouzbeh Banan
9. David Reuss
10. Abigail Suwala
11. Andrey Korshunov
12. Damian Stichel
13. Annika K Wefers
14. Ann-Christin Hau
15. Henning Boldt
16. Patrick N. Harter
17. Zied Abdullaev
18. Jamal Benhamida
19. Daniel Teichmann
20. Arend Koch
21. Jürgen Hench
22. Stephan Frank
23. Martin Hasselblatt
24. Sheila Mansouri
25. Theresita Díaz de Ståhl
26. Jonathan Serrano
27. Jonas Ecker
28. Florian Selt
29. Michael Taylor
30. Vijay Ramaswamy
31. Florence Cavalli
32. Anna S Berghoff
33. Brigitte Bison
34. Mirjam Blattner-Johnson
35. Ivo Buchhalter
36. Rolf Buslei
37. Gabriele Calaminus
38. Nicola Dikow
39. Hildegard Dohmen
40. Philipp Euskirchen
41. Gudrun Fleischhack
42. Amar Gajjar
43. Nicolas U Gerber
44. Marco Gessi
45. Gerrit H Gielen
46. Astrid Gnekow
47. Nicholas G Gottardo
48. Christine Haberler
49. Stefan Hamelmann
50. Volkmar Hans
51. Jordan R Hansford
52. Christian Hartmann
53. Frank L. Heppner
54. Pablo Hernaiz Driever
55. Katja von Hoff
56. Ulrich W Thomale
57. Stephan Tippelt
58. Michael C Frühwald
59. Christof M Kramm
60. Ulrich Schüller
61. Jens Schittenhelm
62. Martin U Schuhmann
63. Marco Stein
64. Petra Ketteler
65. Marc Ladanyi
66. Nada Jabado
67. Barbara C Jones
68. Chris Jones
69. Matthias A Karajannis
70. Ralf Ketter
71. Patricia Kohlhof
72. Uwe Kordes
73. Annekathrin Reinhardt
74. Christian Kölsche
75. Katrin Lamszus
76. Peter Lichter
77. Sybren L N Maas
78. Christian Mawrin
79. Till Milde
80. Michel Mittelbronn
81. Camelia-Maria Monoranu
82. Wolf Mueller
83. Martin Mynarek
84. Paul A Northcott
85. Kristian W Pajtler
86. Werner Paulus
87. Arie Perry
88. Ingmar Blümcke
89. Karl H Plate
90. Michael Platten
91. Matthias Preusser
92. Torsten Pietsch
93. Marco Prinz
94. Guido Reifenberger
95. Bjarne W Kristensen
96. Marcel Kool
97. Volker Hovestadt
98. David W Ellison
99. Thomas S Jacques
100. Pascale Varlet
101. Nima Etminan
102. Till Acker
103. Michael Weller
104. Christine L White
105. Olaf Witt
106. Christel Herold-Mende
107. Jürgen Debus
108. Sandro Krieg
109. Wolfgang Wick
110. Matija Snuderl
111. Ken Aldape
112. Sebastian Brandner
113. Cynthia Hawkins
114. Craig Horbinski
115. Christian Thomas
116. Pieter Wesseling
117. Andreas von Deimling
118. David Capper
119. Stefan M Pfister
120. David TW Jones
121. Felix Sahm
This article has no evaluationsLatest version May 29, 2025
Implementation and Evaluation of Support Vector Machine-Based Models for Cancer Detection Using Multi-Omic Data: A Systematic Review

This article has 12 authors:
1. Zhina Mohamadi
2. Erfan Abtahi
3. Zahra sadat Shayegh
4. Mehrafrin Ataei Kachouei
5. Amin Fakhar
6. Mohammad Mahdi Shirani
7. Mohammadhosein Malekian
8. Amir Zinatshoar
9. Mahdi Biglari
10. Fatemeh Rezaei
11. Armin Zarinkhat
12. Rozhina Mohammadi
This article has no evaluationsLatest version Jul 11, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Background

Results

Conclusions

Article activity feed

Related articles

Multiblock LASSO Framework for Cancer Gene Selection from RNA-Seq PANCAN Data

Advancing CNS tumor diagnostics with expanded DNA methylation-based classification

Implementation and Evaluation of Support Vector Machine-Based Models for Cancer Detection Using Multi-Omic Data: A Systematic Review