MethylCurate: Tool for Dataset Curation and Epigenetic Aging Clock Evaluation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Summary

DNA methylation datasets from public repositories such as NCBI Gene Expression Omnibus are central to the development and evaluation of epigenetic aging clocks, yet existing resources and tools do not fully resolve the bottlenecks of dataset retrieval and metadata harmonization. Current benchmarking frameworks often rely on static curated collections, support only a subset of available Gene Expression Omnibus studies, focus on specific tissues, or require substantial manual intervention when metadata fields and supplementary files are inconsistently structured across studies. We developed MethylCurate, an agentic AI framework that addresses these limitations by automating the retrieval of DNA methylation datasets from the Gene Expression Omnibus, harmonizing heterogeneous metadata, mapping datasets to a unified format, and enabling scalable evaluation of epigenetic aging clocks through an integrated, dialogue-driven workflow.

Availability and Implementation

MethylCurate is implemented in Python and combines deterministic modules for Gene Expression Omnibus dataset retrieval, quality control, and clock evaluation with large language model–assisted agents for metadata extraction, metadata harmonization, and DNA methylation data parsing. Source code, documentation, and example workflows are available at: https://github.com/Travyse/methylcurate

Contact

travyse.edwards@pennmedicine.upenn.edu

Supplementary Information

Supplementary data are available at Bioinformatics online.

Graphical Abstract

MethylCurate is an agentic-AI framework that converts user-specified NCBI Gene Expression Omnibus DNA methylation datasets into standardized metadata, beta matrices, artifacts, logs, and aging clock benchmarking outputs through automated retrieval, quality control, metadata extraction, harmonization, and evaluation workflows. Figure generated with Biorender.

Key Messages

  • Automated curation of DNA methylation datasets from the Gene Expression Omnibus.

  • Standardized preprocessing and metadata harmonization.

  • Integrated benchmarking of epigenetic aging clocks.

Article activity feed