Modeling strategies for a flexible estimation of the crude cumulative incidence in the context of long follow-ups: model choice and predictive ability evaluation
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background Advancements in treatments for chronic diseases, such as breast cancer, have expanded our ability to observe patient outcomes beyond disease-related mortality, including events like distant recurrences. However, competing events can cloud the interpretation of primary outcomes, making the crude cumulative incidence function the only reliable measure for accurate follow-up analysis. Long-term studies call for flexible modeling to accommodate intricate, time-dependent effects and interactions among covariates—something traditional models like the proportional sub-distribution hazards model often struggle to address. While more adaptable methods have been proposed, the need remains to systematically assess model complexity, especially for exploratory purposes. This article presents a statistical learning workflow designed to evaluate model complexity in crude cumulative incidence, also introducing a time-dependent metric for predictive accuracy. This framework offers researchers an enhanced toolkit for tackling robustly the complexities of long-term outcome modeling. Methods Our approach is showed using data on time-to-distant breast cancer recurrences from the Milan 1 and Milan 3 trials, both with extensive follow-up periods. We employ two flexible modeling frameworks—pseudo-observations and sub-distribution hazard models—enhanced with spline functions to capture baseline hazard and risk. The proposed workflow integrates graphical representations of Aalen-Johansen estimates for crude cumulative incidence to visually hypothesize and adjust model complexity to match the studied phenomenon. Information criteria guide model selection to approximate the underlying data structure. Using bootstrapped data perturbations and time-dependent predictive accuracy measures, adjusted with Harrell’s optimism correction, we identify the optimal model structure, balancing explainability, predictivity, and generalizability. Results Our findings emphasize the importance of data perturbation and validation through optimism-corrected predictive measures following the original data analysis. The initial model structure might differ from the most robust model identified through iterative perturbation. The ideal model is one with high robustness (most frequently selected in perturbations) alongside strong explainability and predictive capacity. When perturbation results are inconsistent, evaluating various time-dependent predictive measures offers additional insights, particularly regarding any trade-off between model complexity and predictive gains. In cases where predictive improvement is minimal, simpler, more explainable model structures are preferable. Conclusions The proposed statistical learning workflow, informed by domain expertise, allows for the incorporation of clinically relevant complexities in prognostic modeling. Our results suggest that, in many cases, embracing a nuanced, flexible model structure may better serve future predictions than opting for simpler models. This approach demonstrates the value of a balance between model simplicity and complexity to achieve meaningful, clinically useful insights.