Tuning Knowledge Graph Embeddings in Clustering with LISE
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Knowledge Graph Embeddings are increasingly used in biomedical informatics to support similarity assessment, clustering, and knowledge discovery. Despite strong performance in link prediction, recent studies show that numerical proximity in embedding spaces does not always reflect meaningful semantic similarity. LISE, a logic-based interactive similarity explainer, was introduced to expose shared semantic properties among clustered RDF resources and incorporate user feedback when evaluating cluster coherence. This work extends LISE by integrating Large Language Models for natural-language explanation and investigating whether user-derived relevance signals can actively influence embedding generation, improving the semantic adequacy of similarity-based clustering. Results: We replaced LISE’s template-based verbalization component with a Gemini 2.5 Flash module capable of generating human-readable, path-level explanations of logical Common Subsumers. This resolves previous LISE limitations related to granularity and anaphora resolution, enabling reliable sentence-level user evaluations. To assess whether user preferences can guide the embedding process, we simulated feedback on 1,280 DrugBank-derived triples and evaluated two custom pyRDF2Vec sampling strategies: Predicate Relevance Weight and Predicate-Object Relevance Weight. Relevance weights were learned via a random forest regressor trained on simulated user scores. Predicate-level weighting increased the presence of user-preferred predicates in the most cohesive clusters, with all predicates showing positive or neutral deviation under learned weights. By contrast, predicate-object weighting exhibited limited sensitivity, with most pairs showing unchanged frequency regardless of weight assignment. Average deviation metrics confirm that predicate-level adjustments redirect clustering more effectively toward semantically meaningful biomedical information. Conclusions: User-informed predicate weighting can successfully influence embedding-based clustering, improving alignment with semantically relevant biomedical properties. Predicate-object adjustments provide minimal benefit. Part of this research has been published in the proceedings of the 8th Workshop on Semantic Web Solutions for Large-scale Biomedical Data Analytics (SeWeBMeDA 2025).