SigMine and OPathDb: A Literature-Mining Pipeline and Opportunistic Pathogen Database

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Recognition of cross-domain association between biological entities in the vast biomedical literature is a challenging task. SigMine, an automated pipeline, was constructed to systematically mine biomedical literature to identify significantly associated biological entities. SigMine performs biomedical entity recognition from PMC articles using machine-learning and deep-learning–based entity recognition through Europe PMC Annotation API. Advanced entity recognition using Python scripting, NCBI E-Utilities, and an n-gram algorithm was performed followed by extensive data cleaning and mapping against standard databases. Statistical evaluation identified significant associations between entities. The entire workflow was automated through a modular framework developed in Python v3.13 with a Tkinter-based Graphical User Interface. SigMine enhances usability while retaining the flexibility to use new dictionaries for annotation. SigMine was used to construct a human Opportunistic Pathogens Database (OPathDb), housing 5,626 novel opportunistic pathogens significantly associated with 1,440 diseases and 7,121 genes mined from 25,000 PMC articles. Additional annotation of 598 significantly associated metabolites and 30 affected tissues is available for 3,204 and 227 pathogens respectively. OpathDb has a user-friendly query interface searchable by organism, disease, tissue, gene, protein and metabolite available at https://www.opathdb.cbsblab-nsut.in. Organism–entity associations can be visualized as weighted networks, with color-coded nodes and significance-scaled edges. Significant associations of opportunistic pathogens like Akkermansia mucinifila with colorectal cancer and Segatella copri with glucose intolerance can be identified through OpathDb. The SigMine framework demonstrates efficient recognition and prioritization of relationships in a vast and heterogenous corpora.

Article activity feed