Retrospective cohort study extracting coexisting background breast-lesion features from stage I-III invasive breast cancer

Ryan Jak Yang Lim
Phyu Nitar
Kah Weng Lau
Lester Chee Hao Leong
Geok Hoon Lim
Veronique Kiak Mien Tan
Benita Kiat Tee Tan
Ern Yu Tan
Serene Si Ning Goh
Mikael Hartman
Fuh Yong Wong
Jingmei Li
Joint Breast Cancer Registry

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Background breast features are frequently noted in pathology reports alongside invasive breast cancer but rarely factor into prognosis or treatment decisions. Their relationship to tumor characteristics and patient outcomes remains incompletely characterised.

Methods

We conducted a retrospective cohort study of 7,603 patients with Stage I–III invasive breast cancer (diagnosed 1991–2022, age <80 years) from the Joint Breast Cancer Registry in Singapore. Natural language processing (NLP) was applied to 9,754 free-text pathology reports to extract co-existing background breast features, with accuracy validated by dual-reviewer assessment of 200 reports. Unsupervised hierarchical clustering grouped extracted features into three categories. Associations with tumor characteristics were assessed by multinomial logistic regression, and ten-year overall survival by Cox proportional hazards models (median follow-up 9.6 years; 620 deaths).

Results

Here we show that NLP-based extraction of background breast features from routine pathology reports achieves an accuracy of over 90% across features. Lobular neoplasia and benign proliferative changes are associated with less aggressive tumor characteristics, whereas early neoplastic and papillary lesions are more prevalent in HER2-enriched and luminal B tumor subtypes. Benign proliferative changes are associated with better survival in age- and year-adjusted models (hazard ratio 0.91, 95% CI 0.86–0.97), but this association is attenuated after adjustment for stage and subtype.

Conclusions

NLP-enabled extraction of background breast features from pathology text is feasible at scale. These features reflect tumor biology but do not independently add prognostic information beyond established clinical variables.

PLAIN LANGUAGE SUMMARY

When a patient is diagnosed with breast cancer, the pathologist examining the tissue sample may also observe other changes in the surrounding breast, such as benign growths, cysts, or early precancerous lesions. These incidental findings are often recorded in free-text notes within pathology reports but are rarely studied in large numbers because collecting them by hand is impractical. We used a computer program that reads medical text (natural language processing) to extract these background findings automatically from over 9,700 pathology reports collected across 30 years. We found that certain background changes are linked to specific tumor types, but they do not independently predict which patients will do better or worse once tumor stage and type are considered. This work shows that valuable information in pathology reports can be automatically extracted at scale, opening new opportunities to study the tissue environment in which breast cancers develop.

Version published to 10.64898/2026.05.19.26353633 on medRxiv
May 22, 2026

Screen-Detected and Diagnostic Breast Cancers Show Distinct Treatment Pathways and Quality Indicator Performance

This article has 8 authors:
1. Zuzana Bielčiková
2. Aleš Tichopád
3. Marian Rybář
4. Katarína Petráková
5. Martin Rožánek
6. Karla Mothejlová
7. Ladislav Dušek
8. Gleb Donin
This article has no evaluationsLatest version Jul 16, 2026
Metastatic Patterns and Treatment Characteristics of Triple-Negative Breast Cancer in Nigeria: A Retrospective Cohort Study

This article has 7 authors:
1. AC Sowunmi
2. C Agbakwuru
3. E Aje
4. K Ololade
5. T Andero
6. CG Eze
7. B Oshinkalu
This article has no evaluationsLatest version Jun 12, 2026
Beyond Nodal Status: Interactions Between Molecular Subtype, Tumor Burden, and Survival in 12,225 Patients with Breast Cancer

This article has 9 authors:
1. Majid Akrami
2. Nastaran Tavakolian
3. Hooman Arianpour
4. Amirhesam Moosazadeh
5. Amir Hossein Rajabi
6. Zahra Keumarsi
7. Masoumeh Ghoddusi Johari
8. Vahid Zangouri
9. Abdolrasoul Talei
This article has no evaluationsLatest version Jun 24, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusions

PLAIN LANGUAGE SUMMARY

Article activity feed

Related articles

Screen-Detected and Diagnostic Breast Cancers Show Distinct Treatment Pathways and Quality Indicator Performance

Metastatic Patterns and Treatment Characteristics of Triple-Negative Breast Cancer in Nigeria: A Retrospective Cohort Study

Beyond Nodal Status: Interactions Between Molecular Subtype, Tumor Burden, and Survival in 12,225 Patients with Breast Cancer