Ultra-low coverage fragmentomic model of cell-free DNA for cancer detection based on whole-exome regions

Curation statements for this article:
  • Curated by eLife

    eLife logo

    eLife assessment

    This study provides useful insights for anyone focusing on exonic regions when looking into the investigation of DNA fragmentation patterns (fragmentomics) for circulating tumor DNA (ctDNA) data for cancer detection. The method expands the DELFI method of Cristiano and colleagues (2019), but the datasets chosen are not ideal and the analysis remains incomplete.

This article has been Reviewed by the following groups

Read the full article See related articles

Abstract

Cell-free DNA (cfDNA) has shown promise as a non-invasive biomarker for cancer screening and monitoring. The current advanced machine learning (ML) model, known as DNA evaluation of fragments for early interception (DELFI), utilizes the short and long fragmentation pattern of cfDNA and has demonstrated exceptional performance. However, the application of cfDNA-based model can be limited by the high cost of whole-genome sequencing (WGS). In this study, we present a novel ML model for cancer detection that utilizes cfDNA profiles generated from all protein-coding genes in the genome (exome) with only 0.08X of WGS coverage. Our model was trained on a dataset of 721 cfDNA profiles, comprising 426 cancer patients and 295 healthy individuals. Performance evaluation using a ten-fold cross-validation approach demonstrated that the new ML model using whole-exome regions, called xDELFI, can achieve high accuracy in cancer detection (Area under the ROC curve; AUC=0.896, 95%CI = 0.878 - 0.916), comparable to the model using WGS (AUC=0.920, 95%CI = 0.901 – 0.936). Notably, we observed distinct fragmentation patterns between exonic regions and the whole-genome, suggesting unique genomic features within exonic regions. Furthermore, we demonstrate the potential benefits of combining mutation detection in cfDNA with xDELFI, which enhance the model sensitivity. Our proof-of-principle study indicates that the fragmentomic ML model based solely on whole-exome regions retains its predictive capability. With the ultra-low sequencing coverage of the new model, it could potentially improve the accessibility of cfDNA-based cancer diagnosis and aid in early detection and treatment of cancer.

Article activity feed

  1. eLife assessment

    This study provides useful insights for anyone focusing on exonic regions when looking into the investigation of DNA fragmentation patterns (fragmentomics) for circulating tumor DNA (ctDNA) data for cancer detection. The method expands the DELFI method of Cristiano and colleagues (2019), but the datasets chosen are not ideal and the analysis remains incomplete.

  2. Reviewer #1 (Public Review):

    Summary:

    The authors are looking to assess fragmentomics effects using the Delfi method in exonic regions (Exome sequencing). They argue that this is to make the test more cost effective by extracting this information from exome sequencing.

    Strengths:

    Well written and explained. Different ML approaches tried.

    Weaknesses:

    To assess fragmentomics in WES, it doesn't seem valid to downsample WGS. WES is generated by a different library preparations so to answer this question, it would be necessary to try this in WES samples. The coverage of WES is generally done much higher because this is necessary to assess mutation calls therefore the approach of combining seems flawed because these were not generated by the same experiment.

    The authors do not really show why they included longer fragment sizes in their model that had previously been excluded from the original Delfi publication

    As a proof of concept this is a good idea but really needs a bit of a rethink on the utility and impact.

  3. Reviewer #2 (Public Review):

    Apiwat Sangphukieo et al. have developed machine learning models, exomeDELFI and xDELFI trained on 4 public datasets comprising 721 cfDNA samples. They demonstrate the exomeDELFI model utilizing DNA from whole exome, exhibits higher AUC values compared to the original DELFI model at equal whole-genome sequencing depth for distinguishing patients with and without cancer. Additionally, the xDELFI model, integrating coverage of overall fragments, fragments within 3 fragment size thresholds (short, medium, long) and fragment size distribution (FSD), resulting in 2,952 features, shows improved enhanced prediction performance. Furthermore, the authors have devised a multiclass machine learning model capable of classifying the tissue of origin for eight cancer types, using distinct tissue-specific fragmentomic patterns in cfDNA from whole-exome regions.

    However, the conclusions drawn in this paper rely heavily on cross-validation of machine learning models constructed from hundreds of samples but employing thousands of features, posing a risk of overfitting. Thus, more rigorous validation is warranted.

    (1) The claim in line 18 is misleading. The authors assert that the high cost of whole-genome sequencing (WGS) limited the application of cfDNA in clinic, and therefore imply their model are more cost-efficient by using fewer DNA molecules only originated from exosmic regions. However, WGS is essential in their analysis. Instead of using whole-exome sequencing data, they extracted DNA molecules from WGS data which fall within gene exome regions for feature extraction and downstream analysis, resulting in the same cost for DNA sequencing. In this regard, xDELFI, which selectively uses DNA from exomic regions, demonstrates inferior performance compared to the DELFI model using all WGS data (AUC: 0.896 vs. 0.920) at the same cost using same WGS data.

    (2) The utilization of WGS data from 4 distinct datasets (Jiang et al., 2015, Snyder et al., 2016, Cristiano et al., 2019 and Sun et al., 2019) raises concerns about potential batch effects arising from different DNA library preparation kits (e.g., Kapa Library Preparation Kit (Kapa Biosystems); ThruPLEX DNA-seq kits (Rubicon Genomics); NEBNext DNA Library Prep Kit for Illumina (New England Biolabs); and KAPA HTP Library Preparation Kit (Kapa Biosystems), receptivity). Each kit may induce varying pre-analytical effects on cfDNA fragmentomic features, as evidenced by differing size distribution profiles (e.g., in Fig.4 in Jiang et al., 2015, the cfDNA size distribution profiles show the major peak at ~166 bp with frequency of ~3%. However, in Fig.1B in Snyder et al., 2016, the major peak at ~166 bp is ~2%). To enhance the robustness of their models, the authors should develop sophisticated normalization pipeline to mitigate batch effects and split training and testing sets without mixing any dataset. The author should demonstrate their model performs equally well between training and testing sets and across different datasets.

    (3) The uneven distribution of cancer patients across different datasets introduces another layer of complexity, potentially confounding the analysis of tissue of origin. In line 300, the authors find that liver, colorectal, and lung cancers had the highest prediction accuracy in their models. However, the cancer patient distribution is not even across different datasets (e.g., liver cancer patients are all from Jiang et al., 2015; colorectal cancer patients are mostly from Sun et al., 2019, and Cristiano et al., 2019; and lung cancer patients are mainly from Cristiano et al., 2019. The potential pre-analytical differences in each dataset, coupled with overwhelming cancer types in each database, underscores the importance of addressing these discrepancies to ensure the validity of tissue of origin predictions.

    (4) In Line 145, the authors mention selection of features used in the xDELFI model but did not specify the number of remaining features in each fragmentomic category post-selection. Providing this information would enhance the transparency and reproducibility of their methodology.