LSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations

Abstract

Lifestyle factors (LSFs) are increasingly recognized as instrumental in both the development and control of diseases. Despite their importance, there is a lack of methods to extract relations between LSFs and diseases from the literature, a step necessary to consolidate the currently available knowledge into a structured form. As simple co-occurrence-based relation extraction (RE) approaches are unable to distinguish between the different types of LSF-disease relations, context-aware models such as transformers are required to extract and classify these relations into specific relation types. However, no comprehensive LSF–disease RE system existed, nor a corpus suitable for developing one. We present LSD600 (available at https://zenodo.org/records/13952449), the first corpus specifically designed for LSF–disease RE, comprising 600 abstracts with 1900 relations of eight distinct types between 5027 diseases and 6930 LSF entities. We evaluated LSD600’s quality by training a RoBERTa model on the corpus, achieving an F-score of 68.5% for the multilabel RE task on the held-out test set. We further validated LSD600 by using the trained model on the two Nutrition-Disease and FoodDisease datasets, where it achieved F-scores of 70.7% and 80.7%, respectively. Building on these performance results, LSD600 and the RE system trained on it can be valuable resources to fill the existing gap in this area and pave the way for downstream applications.

Database URL: https://zenodo.org/records/13952449

This Zenodo record is a permanently preserved version of a PREreview. You can view the complete PREreview at https://prereview.org/reviews/13869488.

Overview

The authors assemble the LSD600 corpus of 600 abstracts, where 324 abstracts describe relations between lifestyle factors and diseases (LS-Ds, hence LSDs). The remaining 276 abstracts appeared upon initial selection to potentially contain LSD relations according to their presence in LSF200 and selection by an automated named entity recognition "Tagger". The LSD mentions and relations were manually annotated according to 8 predefined relation types.

The major contributions of the work include the manual annotation of these 600 abstracts, the defining of a LSD relation type hierarchy, and the training of a RoBERTa-based language model for relation extraction.

The LSD dataset will likely be most useful as a resource to train and evaluate more scalable approaches. The authors share their annotations and relation extraction model under permissive open licenses. This work is a timely contribution to a burgeoning field. The obvious next steps are applying the model on all relevant abstracts or accessible full texts as well as grounding diseases and lifestyle factors to controlled vocabularies.

Suggestions

I opened GitHub issues for any suggestions that could involve code or data revisions and additions. The authors have begun addressing some of these requests. I note them below for completeness.

EsmaeilNourani/lifestylefactors-annotation-docs#2 requests a table of the 600 abstracts included in the corpus and several metadata fields. This table allows viewers to easily browse which abstracts are included along with the number of annotated lifestyle factors, diseases, and relations.
EsmaeilNourani/lifestylefactors-annotation-docs#1 requests a table of the 1900 manually annotated relations. This table is the best resource for a reader to easily familiarize themselves with the relationship set comprising the resource.
EsmaeilNourani/lifestylefactors-annotation-docs#3 notes some small but glaring inconsistencies in relation type labels and capitalization.
The "Manual annotation process and corpus evaluation" section discusses some details of the manual curation, including the inter-annotator agreement experiment. Since the curation is a major part of the study, further details on the entire curation task would be helpful. For example, which authors performed the annotations, and how many did each do? Were annotators assigned at the abstract level?

If I close the above referenced GitHub Issues, the editorial staff can consider that an acknowledgment that the suggestions have been adequitely addressed.

Comment

For pubmed:32004098, cocaine has 3 different relations with liver fibrosis: Statistically_associated positive_statistical_association, and NO_statistical_association. I believe these entity mentions are coming from the following snippet:

> No significant association was noted among HIV seronegative participants for liver fibrosis by sex differences or cocaine use. Among African Americans living with HIV, cocaine users were 1.68 times more likely to have liver fibrosis than cocaine nonusers (p = 0.044). Conclusions: Sex differences and cocaine use appear to affect liver disease among African Americans living with HIV pointing to the importance of identifying at-risk individuals to improve outcomes of liver disease.

I believe the annotation is correct, and no action is needed here. I point this out just as an interesting occurrence that highlights the challenge of aggregating textual relations into knowledge/facts.

Competing interests

The author declares that they have no competing interests.

Read the original source

LSD600: the first corpus of biomedical abstracts annotated with lifestyle–disease relations

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Overview

Overview

Suggestions

Comment

Competing interests

MultiMed-ST Datasets for Machine Translation in Medical Applications

Automated Detection Of Clinical High Risk Population Of Schizophrenia: Assessing The Generalizability Of NLP And LLM-Based Methods

Uncovering miRNA–Disease Associations Through Graph Based Neural Network Representations

This article has been Reviewed by the following groups

Discuss this preprint

Listed in

Abstract

Article activity feed

Overview

Overview

Suggestions

Comment

Competing interests

Related articles

MultiMed-ST Datasets for Machine Translation in Medical Applications

Automated Detection Of Clinical High Risk Population Of Schizophrenia: Assessing The Generalizability Of NLP And LLM-Based Methods

Uncovering miRNA–Disease Associations Through Graph Based Neural Network Representations