Dataset Evaluation and Validation for Trustworthy AI

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In this paper, a thorough approach to dataset evaluation and validation is provided as the initial step in the development of Trustworthy Artificial Intelligence (AI) systems. It focuses the fact that the quality of data is the key to the integrity and reliability of any AI model. This work offers a practical, step-by-step tutorial to the technical methods needed to perform a rigorous data validation using a real-world Air Quality dataset. It includes preliminary data checking, treatment of structural and content-based anomalies, including wrong delimiters and non-standard missing value indicators, thorough Exploratory Data Analysis (EDA), and a careful outlier detection and outlier mitigation process. To improve the quality of the data set, preprocessing methods including missing-value imputation, normalization, encoding, and data-type corrections are well applied. Such measures guarantee uniformity and precision, which forms a good basis of downstream machine learning activities. Moreover, It is showing that a well-validated dataset can make a huge difference in the performance of models by training and testing regression models (Linear Regression, Ridge Regression, and Lasso Regression) to predict important chemical pollutants. The model that predicts Benzene (C6H6) concentration has a high R 2 score of about 0.85 whereas the model of Carbon Monoxide (CO) has a lower R 2 of 0.63, which highlights the importance of target variable correlation and data quality in model success. This paper demonstrates that the validation step should not be missed or underestimated as it may result in biased findings, untrustworthy AI behavior, and ethical issues of deployment. Our systematic process, data ingestion to modeling, is a replicable process in the development of fair, transparent, and reliable AI systems. In the end, we can say that dataset validation is not a preliminary stage, but a milestone on the way to building Trustworthy AI.

Article activity feed