Automated L3 Skeletal Muscle Segmentation for Evaluation of Sarcopenia: Development and Independent Validation of an Ensemble-Based 2D nnU-Net Pipeline in a Complex Liver Disease Cohort
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Purpose: To develop a fully automated 2D nnU-Net pipeline for multi-class skeletal muscle segmentation (psoas, paraspinal, and abdominal wall) at the third lumbar (L3) vertebral level, and to quantitatively evaluate its diagnostic performance and reliability compared to manual segmentation. Materials and Methods: A 2D nnU‑Net was trained on 164 axial L3 CT slices from the multi-institutional AMOS22 dataset, spanning diverse abdominal pathologies and multivendor imaging. To assess generalizability under severe anatomical distortion, independent external validation was performed in 50 consecutive patients with advanced liver disease from a single institution (January–December 2025; mean age, 63 ± 15 years; 32 women, 18 men), of whom 88% had moderate-to-severe ascites. Model stability was examined by comparing a five‑fold ensemble with the best‑performing single‑fold model. Intra‑observer reliability of the manual reference standard was evaluated in a random subset of 30 cases. Performance metrics included the Dice Similarity Coefficient (DSC), Pearson correlation coefficient (r), and Bland–Altman analysis for cross‑sectional areas and mean attenuation. The inference workflow was deployed via a custom Streamlit‑based graphical user interface (GUI). Results: In this anatomically complex external validation cohort, the 5-fold ensemble 2D nnU-Net achieved an overall mean DSC of 0.937 ± 0.043, with 80% of cases achieving a mean DSC ≥ 0.90. While the mean DSC was statistically comparable to the best single-fold model (0.937, p = 0.736), the ensemble strategy increased the minimum observed DSC (worst-case performance) from 0.720 to 0.822. Comparison between the ensemble model and manual segmentation yielded a Pearson correlation of r = 0.955 (p < 0.001) for total skeletal muscle area, with a mean bias of +7.17 cm². Intra-observer agreement for the manual reference standard demonstrated a correlation of r = 0.995 for total area. The automated pipeline required 3-5 seconds per case for inference and quantitative reporting, compared to 3-5 minutes for manual segmentation. Conclusion: In patients with advanced liver disease and substantial anatomical distortion from ascites, an ensemble-based 2D nnU‑Net provides quantitative accuracy and measurement agreement comparable to manual L3 skeletal muscle segmentation, while mitigating lower-bound (worst-case) errors relative to single-fold models. Integration with a dedicated GUI enables substantial time savings and supports scalable clinical body composition analysis.