A Generalizable Machine Learning Framework for cfDNA based Early Detection of Hepatocellular Carcinoma: a Feasibility Study with Preclinical Validation

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

PURPOSE

Early detection of hepatocellular carcinoma (HCC) is critical for improving patient outcomes, yet current screening tools lack sensitivity and specificity. We demonstrate a flexible machine learning framework for HCC detection using methylation profiles from bisulfite sequencing across multiple assay platforms and sample types. The framework supports a “split-and-filter” approach that routes each sequenced sample to an assay-matched classifier without requiring cross-assay feature compatibility.

PATIENTS AND METHODS

We constructed assay-specific classifiers using four independent public methylation datasets (∼2,500 total samples) representing distinct bisulfite-sequencing technologies: GSE93203 (MCB-targeted hypermethylation), GSE63775 (MCTA-Seq tandem-repeat hypermethylation), PRJCA001372 (HBV-integration–associated hypomethylation), and the HCC subset of GSE149438 (EpiPanGIDX - DMR-level hyper- and hypomethylation). Separate models were trained using using the biologically relevant features for each dataset and evaluated in two independent blind validation datasets: published tissue WGBS (24 samples: 12 early-stage HCC, 12 matched controls; PRJNA984754) and a new preclinical plasma cfDNA WGBS dataset (12 samples) generated at a commercial sequencing laboratory.

Limited feature overlap among assays precluded a single unified model. Instead, overlapping features enabled construction of a proof-of-concept meta-classifier for sample routing across assay-specific models.

RESULTS

Assay-specific cfDNA models, trained independently on CpG sites from original publications, were evaluated using the biopsy(tissue) dataset and a new plasma dataset as blind validation. All four assay-specific models generalized well to the validation data, with accuracies of 83.5%-100%. In the validation with the plasma cfDNA samples, the best-performing classifier (among XGBoost, Random Forest, and Logistic Regression) for each public dataset achieved 80–100% sensitivity and 86–100% specificity, with all Stage 2 cases correctly detected across models. The single Stage 1A case showed methylation levels overlapping with cirrhotic controls, consistent with biological expectations. Despite this, a couple of the models predicted this correctly, showing greater sensitivity to Stage 1 cancer.

CONCLUSION

A generalizable framework for early detection of HCC composed of assay-specific classifiers and a meta-classifier is described. This architecture readily accommodates addition of new assays via feature-matched models and meta-classifications. Larger, prospectively collected studies are necessary to confirm performance and enable clinical translation.

Article activity feed