Integrating Clustering and Semantic Similarity for MAUDE Database Dimensionality Reduction

Lei Hua

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Objective

To develop and evaluate an automated methodology for dimensionality reduction of the FDA’s MAUDE database through schema matching and merging.

Methods

We conducted 96 trails integrating clustering algorithms with semantic similarity evaluations using the DeepSeek V2.5 API. This approach identified and merged semantically similar tables. Feature extraction was performed using Term Frequency-Inverse Document Frequency (TF-IDF) vectorization and Sentence Transformer embeddings. The methodology was assessed against manual groupings provided by domain experts using metrics such as Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), precision, recall, and F1 score. Different similarity thresholds (0.7, 0.8, 0.9) were applied to evaluate their impact on table merging performance.

Results

The integration of clustering with semantic similarity evaluations enhanced the F1 score from 0.51 (clustering alone) to 1.00, utilizing fewer than 1,425 API similarity evaluations. Consequently, the number of tables was compressed from 113 to 13–16 table groups, a reduction of 86% to 89%. In addition, the application of clustering algorithms decreased the number of table pair comparisons by 77% to 83%. Sentence Transformer embeddings outperformed TF-IDF vectorization in clustering performance, with F1 scores increasing from a range of approximately 0.51–0.87 to 0.51–0.95 in clustering-only scenarios. DeepSeek V2.5 demonstrated the potential to match and quantify subtle semantic differences across various similarity thresholds, maintaining high merging accuracy with F1 scores reaching up to 1.00.

Conclusion

The proposed automated dimensionality reduction methodology effectively enhances data quality and analysis efficiency within the MAUDE database. By reducing the number of tables to manageable groups, optimizing context lengths, and leveraging DeepSeek V2.5’s semantic matching capabilities, the framework streamlines data processing and ensures compatibility with advanced analytical tools such as Large Language Models (LLMs). This makes the methodology applicable across various industries, facilitating more efficient and accurate data analysis workflows

Version published to 10.1101/2024.12.03.24318439v1 on medRxiv
Dec 5, 2024

Using semantic search to find publicly available gene-expression datasets

This article has 11 authors:
1. Grace S. Brown
2. James Wengler
3. Aaron Joyce S. Fabelico
4. Abigail Muir
5. Anna Tubbs
6. Amanda Warren
7. Alexandra N. Millett
8. Xinrui Xiang Yu
9. Paul Pavlidis
10. Sanja Rogic
11. Stephen R. Piccolo
This article has no evaluationsLatest version Mar 15, 2025
Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database

This article has 8 authors:
1. Shuya Ikeda
2. Zhaonan Zou
3. Hidemasa Bono
4. Yuki Moriya
5. Shuichi Kawashima
6. Toshiaki Katayama
7. Shinya Oki
8. Tazro Ohta
This article has no evaluationsLatest version Feb 22, 2025
Transformer-based Ranking Approaches for Keyword Queries over Relational Databases

This article has 4 authors:
1. Paulo Martins
2. Altigran da Silva
3. Johny Moreira
4. Edleno de Moura
This article has no evaluationsLatest version Mar 25, 2025

Listed in

Abstract

Objective

Methods

Results

Conclusion

Article activity feed

Related articles

Using semantic search to find publicly available gene-expression datasets

Extraction of biological terms using large language models enhances the usability of metadata in the BioSample database

Transformer-based Ranking Approaches for Keyword Queries over Relational Databases