Strategies and techniques for quality control and semantic enrichment with multimodal data: a case study in colorectal cancer with eHDPrep

Tom M Toner
Rashi Pancholi
Paul Miller
Thorsten Forster
Helen G Coleman
Ian M Overton

This article has been Reviewed by the following groups

Read the full article

Listed in

Evaluated articles (GigaScience)

Abstract

Background

Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.

Findings

We developed an R package for electronic health data preparation, “eHDPrep,” demonstrated upon a multimodal colorectal cancer dataset (661 patients, 155 variables; Colo-661); a further demonstrator is taken from The Cancer Genome Atlas (459 patients, 94 variables; TCGA-COAD). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment functionality is provided, enabling generation of new informative “meta-variables” according to ontological common ancestry between variables, demonstrated with SNOMED CT and the Gene Ontology in the current study. eHDPrep also facilitates numerical encoding, variable extraction from free text, completeness analysis, and user review of modifications to the dataset.

Conclusions

eHDPrep provides effective tools to assess and enhance data quality, laying the foundation for robust performance and interpretability in downstream analyses. Application to multimodal colorectal cancer datasets resulted in improved data quality, structuring, and robust encoding, as well as enhanced semantic information. We make eHDPrep available as an R package from CRAN (https://cran.r-project.org/package=eHDPrep) and GitHub (https://github.com/overton-group/eHDPrep).

GigaScience
Jun 19, 2023
Background Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.Findings We developed an R package for electronic Health Data preparation ‘eHDPrep’, demonstrated upon a multi-modal colorectal cancer dataset (n=661 patients, n=155 variables; Colo-661). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment …
Background Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.Findings We developed an R package for electronic Health Data preparation ‘eHDPrep’, demonstrated upon a multi-modal colorectal cancer dataset (n=661 patients, n=155 variables; Colo-661). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment functionality is provided, enabling generation of new informative ‘meta-variables’ according to ontological common ancestry between variables, demonstrated with SNOMED CT and the Gene Ontology in the current study. eHDPrep also facilitates numerical encoding, variable extraction from free-text, completeness analysis and user review of modifications to the dataset.Conclusion eHDPrep provides effective tools to assess and enhance data quality, laying the foundation for robust performance and interpretability in downstream analyses. Application to a multi-modal colorectal cancer dataset resulted in improved data quality, structuring, and robust encoding, as well as enhanced semantic information. We make eHDPrep available as an R package from CRAN [[URL will go here]].

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer Hugo Leroux

This well-written paper describes techniques for semantically-enriching clinical data pertaining to colorectal cancer diagnosis.It describes an R-based tool, eHDPrep, to extract the data, which is subsequently cleaned, actioned for missing and erroneous values, encoded and enriched semantically using SNOMED CT and the GO, and ultimately exported after having undergone some QC.The paper is well-written and the methods really well-explained, for which the authors should be commended.I only have a few comments for the authors:

It is not clear to me how, in the discussion on page 14, the authors have dealt with the issue of representing negative findings and missing values, as described within their enrichment outcomes section.

In the "Ontology Preparation" section, the authors describe how they have taken both the SNOMED CT terminology and performed some transformations to OWL and conversion to CSV format before mapping the Colo-661 variables to it. They don't however discuss the challenges that such an approach entails. The authors might consider perusing through this article (https://doi.org/10.1186/s13326-018-0191-z), which addresses many of the challenges relating to ontology matching

Please insert an additional ")" when stating the "Equations", e.g. page 6: "... zero entropy [27] (Equation (1)) ...", also , page 13
Read the original source
GigaScience
Jun 19, 2023
Background Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.Findings We developed an R package for electronic Health Data preparation ‘eHDPrep’, demonstrated upon a multi-modal colorectal cancer dataset (n=661 patients, n=155 variables; Colo-661). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment …
Background Integration of data from multiple domains can greatly enhance the quality and applicability of knowledge generated in analysis workflows. However, working with health data is challenging, requiring careful preparation in order to support meaningful interpretation and robust results. Ontologies encapsulate relationships between variables that can enrich the semantic content of health datasets to enhance interpretability and inform downstream analyses.Findings We developed an R package for electronic Health Data preparation ‘eHDPrep’, demonstrated upon a multi-modal colorectal cancer dataset (n=661 patients, n=155 variables; Colo-661). eHDPrep offers user-friendly methods for quality control, including internal consistency checking and redundancy removal with information-theoretic variable merging. Semantic enrichment functionality is provided, enabling generation of new informative ‘meta-variables’ according to ontological common ancestry between variables, demonstrated with SNOMED CT and the Gene Ontology in the current study. eHDPrep also facilitates numerical encoding, variable extraction from free-text, completeness analysis and user review of modifications to the dataset.Conclusion eHDPrep provides effective tools to assess and enhance data quality, laying the foundation for robust performance and interpretability in downstream analyses. Application to a multi-modal colorectal cancer dataset resulted in improved data quality, structuring, and robust encoding, as well as enhanced semantic information. We make eHDPrep available as an R package from CRAN [[URL will go here]].

This work has been peer reviewed in GigaScience (see https://doi.org/10.1093/gigascience/giad030 ), which carries out open, named peer-review. These reviews are published under a CC-BY 4.0 license and were as follows:

Reviewer Janna Hastings

The manuscript describes a toolkit for the automated semantic enrichment and quality control of electronic health data using ontologies. This is a much needed utility that will add value to electronic data sharing and re-use for many different purposes including the development of machine learning for medical applications and personalised medicine. Overall the manuscript is well written and the functionality offered by the toolkit is well thought out and motivated. The internal consistency checks and the use of ontology-based information content to semantically aggregate variables into more informative meta-variables are particularly welcome functions.

However, I recommend that the description of the tool functionality be clarified in some points, and the evaluation could be strengthened.page 6-7, internal consistency:

How should the user specify semantic dependencies between variable pairs? Would it not be helpful to use a standard format for this specification to enable interoperability and re-use of such specifications?

Should the specification of semantic relationships between variables not be linked to the knowledge from the ontologies? Ontologies are able to represent many different types of logical relationships between classes, which make them ideal for then serving as a standard and interoperable format for specifying this type of constraint. Rules are another promising standard approach for logic-based knowledge representation.

Page 11, figure 4 a: I think it would be informative for evaluating the operation of the tool if the heatmap of variable missingness after application of the tool could also be illustrated beside the current Fig 4a.

Page 13, ontology preparation: The paragraph describes what the authors have done to prepare ontologies for use with the tool. Is this preparation procedure also necessary for users to follow when they use the eHDPrep tool? How can alternative ontologies be incorporated (which may be useful for other domains)?Evaluation: The biggest shortcoming of the presented manuscript is that the evaluation is limited to the application of the tool to one dataset and subsequent manual evaluation of the outcome by one group, the study authors.

The results as presented are positive, but there is a significant risk that the tool performs well on this task, as assessed by these study authors, but then fails to generalise to other tasks and datasets that future users might wish to use it with. To mitigate against this challenge, it would be optimal if somewhat more independent methods could be found for evaluating the performance of the different aspects of the tool. One approach could a rigorous comparison of this tool's performance against the performance of other tools that have similar functionality, e.g. comparison of the semantic aggregation function with other tools that find and recommend MICAs. An alternative approach might be to apply the tool to an additional dataset for which a group outside of the study authors would be prepared to provide an independent evaluation.
Read the original source
Version published to 10.1093/gigascience/giad030
Dec 28, 2022
Version published to 10.1101/2022.09.07.506953v1 on bioRxiv
Sep 9, 2022

BioHackEU24 report: Expanding FAIR database integration through elucidation and transformation of underlying graph schemas

This article has 9 authors:
1. Javier Millán Acosta
2. Shuichi Kawashima
3. Toshiaki Katayama
4. Jerven Bolleman
5. Dominik Martinat
6. Harald Detering
7. Jose Emilio Labra Gayo
8. Yojana Gadiya
9. Tooba Abbassi-Daloii
This article has no evaluationsLatest version May 17, 2025
HEAL-KGGen: A Hierarchical Multi-Agent LLM Framework with Knowledge Graph Enhancement for Genetic Biomarker-Based Medical Diagnosis

This article has 6 authors:
1. Kaiwen Zuo
2. Zixuan Zhong
3. Peizhou Huang
4. Shiyan Tang
5. Yuyan Chen
6. Yirui Jiang
This article has no evaluationsLatest version Jun 6, 2025
Application of qualifying variants for genomic analysis

This article has 11 authors:
1. Dylan Lawless
2. Ali Saadat
3. Mariam Ait Oumelloul
4. Simon Boutry
5. Veronika Stadler
6. Sabine Österle
7. Jan Armida
8. David Haerry
9. D. Sean Froese
10. Luregn J. Schlapbach
11. Jacques Fellay
This article has no evaluationsLatest version Jun 25, 2025

This article has been Reviewed by the following groups

Listed in

Abstract

Background

Findings

Conclusions

Article activity feed

Related articles

BioHackEU24 report: Expanding FAIR database integration through elucidation and transformation of underlying graph schemas

HEAL-KGGen: A Hierarchical Multi-Agent LLM Framework with Knowledge Graph Enhancement for Genetic Biomarker-Based Medical Diagnosis

Application of qualifying variants for genomic analysis