Leveraging the largest harmonized epigenomic data collection for metadata prediction validated and augmented over 350,000 public epigenomic datasets

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Epigenomic data found in public databases often suffer from issues of non-standardization and incompleteness in their associated metadata. There are currently no automated approaches to validate or correct missing or inaccurate information listed in databases. To tackle this challenge, we harnessed the extensive harmonized data and metadata provided by the EpiATLAS project of the International Human Epigenome Consortium (IHEC) to train EpiClass, a suite of machine learning classifiers that can predict key metadata (∼98% accuracy), including experimental assay, donor sex, biospecimen and sample cancer status. The development of these classifiers enabled the identification of a few mislabeled and low-quality datasets in the EpiATLAS project, while also completing with high-confidence most of the missing metadata. These classifiers were also validated on ENCODE datasets absent from the initial training, then applied to assess more than 350,000 human ChIP-Seq and RNA-Seq datasets from public repositories. Overall, this effort not only validated the accuracy of the vast majority of assays reported by the original authors, but also unveiled ∼500 datasets with discrepancies, in particular through data swap within series of experiments. More importantly, EpiClass also supplied high-confidence predictions for over 320,000 metadata attributes of the biological sample such as the sex, cancer status and biomaterial type, which had been originally omitted in the majority of cases. Our work introduces the first systematic approach for metadata correction and augmentation, enhancing the quality and reliability of publicly available epigenomic data.

Article activity feed