AUGMENT: a framework for robust assessment of the clinical utility of segmentation algorithms
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Evaluating AI-based segmentation models primarily relies on quantitative metrics, but it remains unclear if this approach leads to practical, clinically applicable tools.
Purpose
To create a systematic framework for evaluating the performance of segmentation models using clinically relevant criteria.
Materials and Methods
We developed the AUGMENT framework (Assessing Utility of seGMENtation Tools), based on a structured classification of main categories of error in segmentation tasks. To evaluate the framework, we assembled a team of 20 clinicians covering a broad range of radiological expertise and analysed the challenging task of segmenting metastatic ovarian cancer using AI. We used three evaluation methods: (i) Dice Similarity Coefficient (DSC), (ii) visual Turing test, assessing 429 segmented disease-sites on 80 CT scans from the Cancer Imaging Atlas), and (iii) AUGMENT framework, where 3 radiologists and the AI-model created segmentations of 784 separate disease sites on 27 CT scans from a multi-institution dataset.
Results
The AI model had modest technical performance (DSC=72±19 for the pelvic and ovarian disease, and 64±24 for omental disease), and it failed the visual Turing test. However, the AUGMENT framework revealed that (i) the AI model produced segmentations of the same quality as radiologists ( p =.46), and (ii) it enabled radiologists to produce human+AI collaborative segmentations of significantly higher quality ( p =<.001) and in significantly less time ( p =<.001).
Conclusion
Quantitative performance metrics of segmentation algorithms can mask their clinical utility. The AUGMENT framework enables the systematic identification of clinically usable AI-models and highlights the importance of assessing the interaction between AI tools and radiologists.
Summary statement
Our framework, called AUGMENT, provides an objective assessment of the clinical utility of segmentation algorithms based on well-established error categories.
Key results
-
Combining quantitative metrics with qualitative information on performance from domain experts whose work is impacted by an algorithm’s use is a more accurate, transparent and trustworthy way of appraising an algorithm than using quantitative metrics alone.
-
The AUGMENT framework captures clinical utility in terms of segmentation quality and human+AI complementarity even in algorithms with modest technical segmentation performance.
-
AUGMENT might have utility during the development and validation process, including in segmentation challenges, for those seeking clinical translation, and to audit model performance after integration into clinical practice.