Performance of Information Theory Derived Semantic Similarity Algorithms for Differential Diagnosis and Clustering

Ben Coleman
Daniel Danis
Justin Reese
Peter Robinson

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Semantic similarity analysis with Human Phenotype Ontology (HPO) enables fuzzy, specificity weighted comparisons of clinical manifestations of individuals and diseases and can be used to support differential diagnostics or to stratify cohorts. Many methods have been proposed to calculate semantic similarity for various applications, including the Phenomizer, which calculates the average best match over all terms in the query and disease, and set-based methods ranging from the Jaccard Intersection to methods that leverage the conditional information content to calculate similarity. However, these methods have not been described under a single mathematical model or robustly compared using a comprehensive data set. Here, we describe several semantic similarity algorithms using derivations based on information theory, propose three of our own variations to these models, and compare the performance of each approach for differential diagnostic ranking and phenotypic clustering.

We find that Phenomizer performs better when diseases are ranked by similarity alone, without generating p-values. Additionally, non-normalized algorithms that use conditional information perform similarly to Phenomizer for differential diagnosis. In contrast, normalized algorithms perform best when clustering cohorts.

Availability

Data is available through the Phenopacket-Store ( https://github.com/monarch-initiative/phenopacket-store ). Algorithms are implemented in the Python package SetSim ( https://github.com/P2GX/setsim ).

Version published to 10.1101/2025.11.17.688933 on bioRxiv
Nov 18, 2025

Intelligent Semantic Search Engine for Biomedical Literature and Clinical Trials: A Comprehensive Hybrid Retrieval Framework

This article has 1 author:
1. Sasidhara Kashyap Chaturvedula
This article has no evaluationsLatest version Jan 29, 2026
Tuning Knowledge Graph Embeddings in Clustering with LISE

This article has 5 authors:
1. Verdiana Schena
2. Simona Colucci
3. Donini Francesco Maria
4. Floriano Scioscia
5. Eugenio Di Sciascio
This article has no evaluationsLatest version Dec 15, 2025
Classification of Bio-Data with Interval Dissimilarities: A Multidimensional Scaling Framework

This article has 4 authors:
1. Md. Anwarul Islam Bhuiyan
2. Sohana Jahan
3. Md. Babul Hasan
4. Md. Maruf Hossain
This article has no evaluationsLatest version Jan 21, 2026

Discuss this preprint

Listed in

Abstract

Availability

Article activity feed

Related articles

Intelligent Semantic Search Engine for Biomedical Literature and Clinical Trials: A Comprehensive Hybrid Retrieval Framework

Tuning Knowledge Graph Embeddings in Clustering with LISE

Classification of Bio-Data with Interval Dissimilarities: A Multidimensional Scaling Framework