Performance of Information Theory Derived Semantic Similarity Algorithms for Differential Diagnosis and Clustering
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Semantic similarity analysis with Human Phenotype Ontology (HPO) enables fuzzy, specificity weighted comparisons of clinical manifestations of individuals and diseases and can be used to support differential diagnostics or to stratify cohorts. Many methods have been proposed to calculate semantic similarity for various applications, including the Phenomizer, which calculates the average best match over all terms in the query and disease, and set-based methods ranging from the Jaccard Intersection to methods that leverage the conditional information content to calculate similarity. However, these methods have not been described under a single mathematical model or robustly compared using a comprehensive data set. Here, we describe several semantic similarity algorithms using derivations based on information theory, propose three of our own variations to these models, and compare the performance of each approach for differential diagnostic ranking and phenotypic clustering.
We find that Phenomizer performs better when diseases are ranked by similarity alone, without generating p-values. Additionally, non-normalized algorithms that use conditional information perform similarly to Phenomizer for differential diagnosis. In contrast, normalized algorithms perform best when clustering cohorts.
Availability
Data is available through the Phenopacket-Store ( https://github.com/monarch-initiative/phenopacket-store ). Algorithms are implemented in the Python package SetSim ( https://github.com/P2GX/setsim ).