A Comparative Evaluation of Outlier Detection in Categorical and Mixed Data

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Outlier detection is essential in different domains such as cybersecurity and fraud detection, to name a few. However, identifying the best way to detect outliers is often a challenge. Although most algorithms are designed for numerical data, many real-world datasets contain categorical attributes or a mixture of categorical and numerical ones. Given a dataset with one or more categorical attributes, how to detect the outliers? This survey evaluates three potential solutions: (1) applying algorithms that can process categorical data directly, (2) converting categorical attributes into numerical ones before the detection, and (3) removing categorical attributes so that only the numerical ones are considered in the detection. We performed experiments using 47 datasets and 14 detection algorithms, and demonstrated that Solution (1) is usually preferred, especially when employing the detection algorithm CBRW. However, Solution (2) with detection algorithms such as iForest and KNN-outlier achieves better results in certain contexts, being influenced by the data characteristics. Based on these findings, we also introduce a predictive model that achieves 80% accuracy in identifying the best strategy to process new datasets among the three solutions studied. Additionally, we compared approaches to convert categorical attributes into numerical ones, and showed that the Correspondence Analysis data-conversion method often yields the best results. This survey provides comparative insights, methodological guidance, and predictive support for outlier detection in categorical and mixed data.

Article activity feed