A Methodology to Extract Knowledge from Datasets Using ML

Ricardo Sánchez de Madariaga
Mario Pascual Carrasco
Adolfo Muñoz Carrero

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study aims to verify whether there is any relationship between the different classification outputs produced by distinct ML algorithms and the relevance of the data they classify, for addressing the problem of the knowledge extraction (KE) from datasets. If such relationship existed, the main objective of this research is to use it in order to improve performance in the important task of KE from datasets. A new dataset generation and a new ML classification measurement methodology were developed to check whether the feature subsets (FSs) best classified by a specific ML algorithm correspond to the most KE-relevant combinations of features. Medical expertise was extracted to check knowledge relevance using two LLMs, namely chat GPT and Google Gemini. Some specific ML algorithms fit much better than others for a working dataset extracted from a given probability distribution. They best classify FSs that contain combinations of features particularly knowledge relevant. This implies that using a specific ML algorithm we can indeed extract useful scientific knowledge. The best-fitting ML algorithm is not known a priori. However, we can bootstrap its identity using a small amount of medical expertise, and we have a powerful tool for extracting (medical) knowledge from datasets using ML.

Version published to 10.20944/preprints202504.0870.v1
Apr 10, 2025

Optimizing Seminal Quality Prediction Using Machine Learning with Data Preprocessing and Feature Selection

This article has 6 authors:
1. Aamir Farooq
2. Zhengrong Xiang
3. Musaed Alhussein
4. Muhammad Shahzad
5. Muhammad Farhan
6. Khursheed Aurangzeb
This article has no evaluationsLatest version Apr 9, 2025
Improving the Robustness of Large Language Models in Extracting Social Determinants of Health

This article has 2 authors:
1. Jiashu Chen
2. Chase Simmons
This article has no evaluationsLatest version Mar 24, 2025
Active learning pipeline to automatically identify candidate terms for a CDSS ontology—measures, experiments, and performance

This article has 17 authors:
1. Shailesh Alluri
2. Keerthana Komatineni
3. Rohan Goli
4. Nina Hubig
5. Hua Min
6. Yang Gong
7. Dean F. Sittig
8. David Robinson
9. Paul Biondich
10. Adam Wright
11. Christian Nøhr
12. Timothy Law
13. Arild Faxvaag
14. Richard D. Boyce
15. Ronald Gimbel
16. Lior Rennert
17. Xia Jing
This article has no evaluationsLatest version Apr 17, 2025

Listed in

Abstract

Article activity feed

Related articles

Optimizing Seminal Quality Prediction Using Machine Learning with Data Preprocessing and Feature Selection

Improving the Robustness of Large Language Models in Extracting Social Determinants of Health

Active learning pipeline to automatically identify candidate terms for a CDSS ontology—measures, experiments, and performance