Retrospective cohort study extracting coexisting background breast-lesion features from stage I-III invasive breast cancer
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Background breast features are frequently noted in pathology reports alongside invasive breast cancer but rarely factor into prognosis or treatment decisions. Their relationship to tumor characteristics and patient outcomes remains incompletely characterised.
Methods
We conducted a retrospective cohort study of 7,603 patients with Stage I–III invasive breast cancer (diagnosed 1991–2022, age <80 years) from the Joint Breast Cancer Registry in Singapore. Natural language processing (NLP) was applied to 9,754 free-text pathology reports to extract co-existing background breast features, with accuracy validated by dual-reviewer assessment of 200 reports. Unsupervised hierarchical clustering grouped extracted features into three categories. Associations with tumor characteristics were assessed by multinomial logistic regression, and ten-year overall survival by Cox proportional hazards models (median follow-up 9.6 years; 620 deaths).
Results
Here we show that NLP-based extraction of background breast features from routine pathology reports achieves an accuracy of over 90% across features. Lobular neoplasia and benign proliferative changes are associated with less aggressive tumor characteristics, whereas early neoplastic and papillary lesions are more prevalent in HER2-enriched and luminal B tumor subtypes. Benign proliferative changes are associated with better survival in age- and year-adjusted models (hazard ratio 0.91, 95% CI 0.86–0.97), but this association is attenuated after adjustment for stage and subtype.
Conclusions
NLP-enabled extraction of background breast features from pathology text is feasible at scale. These features reflect tumor biology but do not independently add prognostic information beyond established clinical variables.
PLAIN LANGUAGE SUMMARY
When a patient is diagnosed with breast cancer, the pathologist examining the tissue sample may also observe other changes in the surrounding breast, such as benign growths, cysts, or early precancerous lesions. These incidental findings are often recorded in free-text notes within pathology reports but are rarely studied in large numbers because collecting them by hand is impractical. We used a computer program that reads medical text (natural language processing) to extract these background findings automatically from over 9,700 pathology reports collected across 30 years. We found that certain background changes are linked to specific tumor types, but they do not independently predict which patients will do better or worse once tumor stage and type are considered. This work shows that valuable information in pathology reports can be automatically extracted at scale, opening new opportunities to study the tissue environment in which breast cancers develop.