InterFeat: A Pipeline for Finding Interesting Scientific Features
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Finding interesting phenomena is the core of scientific discovery, but the notion of interestingness is vaguely defined and heavily reliant on manual judgment. We present InterFeat, an integrative pipeline for automating the discovery and ranking of interesting features (InterFeat) in structured biomedical data. The pipeline combines machinelearning, knowledge graphs, literature search andLarge Language Models. We formalize “interestingness” as a combination of novelty, utility and plausibility. In a time-split evaluation, InterFeat was trained only on data available before a certain cutoff, and managed to surface risk factors years ahead of their eventual discovery: across eight major diseases, up to 21% of its suggested factors appeared in the literature after the cutoff. In a human evaluation, four senior physicians annotated InterFeat’s suggestions, deeming 28%of them interesting. Out of highly-ranked candidates, 40–53% were interesting, vs. 0–7% for a SHAP baseline. InterFeat addresses the challenge of operationalizing “interestingness” scalably for any target with existing literature. Code and data: https://github.com/LinialLab/InterFeat