A Comparative Evaluation of Outlier Detection in Categorical and Mixed Data

Felippe Pires Ferreira
Robson Leonardo Ferreira Cordeiro

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Outlier detection is essential in different domains such as cybersecurity and fraud detection, to name a few. However, identifying the best way to detect outliers is often a challenge. Although most algorithms are designed for numerical data, many real-world datasets contain categorical attributes or a mixture of categorical and numerical ones. Given a dataset with one or more categorical attributes, how to detect the outliers? This survey evaluates three potential solutions: (1) applying algorithms that can process categorical data directly, (2) converting categorical attributes into numerical ones before the detection, and (3) removing categorical attributes so that only the numerical ones are considered in the detection. We performed experiments using 47 datasets and 14 detection algorithms, and demonstrated that Solution (1) is usually preferred, especially when employing the detection algorithm CBRW. However, Solution (2) with detection algorithms such as iForest and KNN-outlier achieves better results in certain contexts, being influenced by the data characteristics. Based on these findings, we also introduce a predictive model that achieves 80% accuracy in identifying the best strategy to process new datasets among the three solutions studied. Additionally, we compared approaches to convert categorical attributes into numerical ones, and showed that the Correspondence Analysis data-conversion method often yields the best results. This survey provides comparative insights, methodological guidance, and predictive support for outlier detection in categorical and mixed data.

Version published to 10.21203/rs.3.rs-8875243/v1 on Research Square
Mar 2, 2026

Classification of Abnormal Patterns in Traffic Analysis Based on a Fusion Approach

This article has 5 authors:
1. Fathi E. Abd El-Samie
2. Nabil A. Ismail
3. Adel S. El-Fishawy
4. Khalil F. Ramadan
5. Hesham M. AbdelZaher
This article has no evaluationsLatest version Feb 24, 2026
Comparative Evaluation of Machine Learning Models with Different Data Balancing Techniques for DDoS Attack Detection

This article has 3 authors:
1. Dipok Deb
2. Hansapani Rodrigo
3. Sanjeev Kumar
This article has no evaluationsLatest version Feb 18, 2026
A Comprehensive Survey on Clustering Algorithms: Concepts, Taxonomy with Nature-Inspired Meta-Heuristic Approaches and Performance Metrics

This article has 2 authors:
1. Yuvaraj M
2. Sivaprakash S
This article has no evaluationsLatest version Mar 6, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Classification of Abnormal Patterns in Traffic Analysis Based on a Fusion Approach

Comparative Evaluation of Machine Learning Models with Different Data Balancing Techniques for DDoS Attack Detection

A Comprehensive Survey on Clustering Algorithms: Concepts, Taxonomy with Nature-Inspired Meta-Heuristic Approaches and Performance Metrics