Evaluation of Datasets for Outlier Detection

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Outlier detection is a common task in various fields of knowledge. When devel-oping and validating new detection algorithms, one must use suitable datasetscontaining known anomalous instances. Despite the large number of datasetsavailable, they are usually derived from classification problems, which were laterconverted into outlier detection problems by following a process that is prone togenerate errors. The objective of our work is not to evaluate detection algorithms,but to evaluate the datasets used to develop and validate the algorithms. Thisresearch proposes a new methodology to assess the quality of datasets used inanomaly detection, which made it possible to identify problems in several well-known datasets. The proposed methodology allowed us to evaluate 59 datasetswith distinct characteristics, assess the labels assigned to instances, and compareinliers and outliers. During the methodology’s evaluation, we used 22 detectionalgorithms on the selected datasets. By leveraging our new methodology, weidentified datasets often used in outlier detection tasks that present question-able ground-truth, with instances that do not behave as expected, and potentialnegative impact in any algorithm validated on this data.

Article activity feed