CausalKnowledgeTrace: A Novel Computational Framework for Automated Literature-Based Causal Graph Construction and Evidence-Based Variable Selection in Biomedical Research

Rajesh Upadhayaya
Manjil Pradhan
Vincent Metzger
Scott Alexander Malec

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Variable selection for causal inference from observational biomedical data is challenging, as overlooking confounders or conditioning on colliders leads to biased estimates. While vast causal knowledge exists in biomedical literature, manually extracting this information for principled variable selection is impractical at scale.

Methods

We developed CausalKnowledgeTrace, a Python-based computational framework with Django web interface that systematically leverages structured causal knowledge from the Semantic MEDLINE Database (SemMedDB) to inform variable selection in causal studies. The system implements a six-stage analysis pipeline using NetworkX for graph operations, including graph parsing, basic analysis, comprehensive cycle detection, systematic generic node removal, post-removal analysis, and formal causal inference with bias detection.

Results

Analysis of the hypertension-Alzheimer’s relationship across three degree neighborhoods (1-3) demonstrated systematic scaling of causal complexity: 361-866 variables, 429-1,442 relationships, with graph densities of 0.0033-0.0019. The analysis revealed complex cyclic structures with 54-606 baseline cycles across degree levels. Processing times ranged from 0.3-1.0 seconds for all three degrees, demonstrating computational efficiency for complex biomedical networks. Key confounders identified across all degrees included inflammation, diabetes, insulin resistance, obesity, and ischemia. In the third degree of graph, the pipeline structurally identified 39 confounders, 11 mediators, and 3 colliders from the causal graph. Among the key identified confounders and mediators—including obesity, oxidative stress, ischemia, and vascular diseases—all were found to have strong supporting evidence in established epidemiological and pathophysiological literature.

Conclusions

CausalKnowledgeTrace provides a scalable, evidence-based approach to causal graph construction that systematically identifies confounders and bias structures often missed by conventional approaches. The Python-Django architecture enables both standalone analysis and integration into larger computational workflows, representing a significant advance in computational support for causal inference in biomedical research.

Statement of Significance

Problem or Issue

Selecting proper confounders and variables for causal inference from observational biomedical datasets is challenging and often biased by limited expertise or manual review.

What is Already Known

Existing approaches rely on domain experts, statistical variable screening, or manual construction of causal graphs, but these often overlook literature-documented confounders and complex biases.

What this Paper Adds

This paper introduces an automated, literature-based framework for synthesizing and validating causal graphs, identifying critical variables and complex bias structures, such as M-bias and butterfly bias, with full evidentiary traceability.

Who would benefit from the new knowledge in this paper?

Epidemiologists, biomedical researchers, informaticians, and clinical investigators seeking reliable and transparent causal modeling for observational studies.

Version published to 10.64898/2026.05.07.723601 on bioRxiv
May 12, 2026

Corpus-wide causality: Algorithm design & application for aggregating gene-disease causal evidence

This article has 4 authors:
1. Nency Bansal
2. Adwait P. Parsodkar
3. Ayush Pathak
4. Manikandan Narayanan
This article has no evaluationsLatest version May 12, 2026
Interpreting Omics Data Analysis with Large Language Models for Disease Target and Drug Discovery

This article has 10 authors:
1. Zixi Xu
2. Weihang Chen
3. Wuyu Ren
4. Tianqi Xu
5. Somadina Amaechina
6. Raad Khan
7. Yixin Chen
8. Michael Province
9. Philip Payne
10. Fuhai Li
This article has no evaluationsLatest version May 5, 2026
Relational biological structure improves fine-mapping of causal GWAS variants under weak signal

This article has 5 authors:
1. Ehsan Estaji
2. Shi-Wei Zhao
3. Zhao-Yang Chen
4. Shuai Nie
5. Jian-Feng Mao
This article has no evaluationsLatest version May 16, 2026