Leveraging the largest harmonized epigenomic data collection for metadata prediction validated and augmented over 350,000 public epigenomic datasets

Joanny Raby
Gabriella Frosi
Frédérique White
Jonathan Laperle
Pierre-Étienne Jacques

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Epigenomic data found in public databases often suffer from issues of non-standardization and incompleteness in their associated metadata. There are currently no automated approaches to validate or correct missing or inaccurate information listed in databases. To tackle this challenge, we harnessed the extensive harmonized data and metadata provided by the EpiATLAS project of the International Human Epigenome Consortium (IHEC) to train EpiClass, a suite of machine learning classifiers that can predict key metadata (∼98% accuracy), including experimental assay, donor sex, biospecimen and sample cancer status. The development of these classifiers enabled the identification of a few mislabeled and low-quality datasets in the EpiATLAS project, while also completing with high-confidence most of the missing metadata. These classifiers were also validated on ENCODE datasets absent from the initial training, then applied to assess more than 350,000 human ChIP-Seq and RNA-Seq datasets from public repositories. Overall, this effort not only validated the accuracy of the vast majority of assays reported by the original authors, but also unveiled ∼500 datasets with discrepancies, in particular through data swap within series of experiments. More importantly, EpiClass also supplied high-confidence predictions for over 320,000 metadata attributes of the biological sample such as the sex, cancer status and biomaterial type, which had been originally omitted in the majority of cases. Our work introduces the first systematic approach for metadata correction and augmentation, enhancing the quality and reliability of publicly available epigenomic data.

Version published to 10.1101/2025.09.04.670545 on bioRxiv
Sep 4, 2025

MiCoReCa (Microbiome Community Resource Catalogue) - Towards Centralized Curation And Integration Of Microbiome Bioinformatics Resources

This article has 8 authors:
1. Vivek Ashokan
2. Clara Emery
3. Agnès Barnabé
4. Valentin Loux
5. Christina Pavloudi
6. Paul Zierep
7. Nikolaos Strepis
8. Bérénice Batut
This article has no evaluationsLatest version Jan 6, 2026
META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

This article has 8 authors:
1. Louis-Maël Guéguen
2. Alban Mathieu
3. Simon Pelletier
4. Anthony Woo
5. Namita Misra
6. Magali Moreau
7. Olivier Perin
8. Arnaud Droit
This article has no evaluationsLatest version Jan 29, 2026
MiCoReCa (Microbiome Community Resource Catalogue) - Towards Centralized Curation And Integration Of Microbiome Bioinformatics Resources

This article has 8 authors:
1. Vivek Ashokan
2. Clara Emery
3. Agnès Barnabé
4. Valentin Loux
5. Christina Pavloudi
6. Paul Zierep
7. Nikolaos Strepis
8. Bérénice Batut
This article has no evaluationsLatest version Jan 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

MiCoReCa (Microbiome Community Resource Catalogue) - Towards Centralized Curation And Integration Of Microbiome Bioinformatics Resources

META-DIFF: a k-mer-based pipeline that detects differentially abundant sequences in metagenomics whole genome sequencing

MiCoReCa (Microbiome Community Resource Catalogue) - Towards Centralized Curation And Integration Of Microbiome Bioinformatics Resources