Effective Hybrid Sampling Approach for Evaluating Classification Performance
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
In order to evaluate the classification performance of an algorithm, it is necessary to partition the original dataset into training and test subsets. After constructing a classification model using the training dataset, the test dataset is utilized to evaluate its accuracy. However, accurately assessing classification performance typically requires multiple rounds of training/test data sampling, model construction, and accuracy evaluation, followed by averaging the results. This process is computationally expensive and time-consuming. To address this issue, we propose an effective sampling approach that allows for the selection of training and test sets which closely approximate the outcomes derived from repeated sampling and evaluation processes. Our approach ensures that the sampled data closely reflects the classification performance on the original dataset. Specifically, we introduce various techniques to measure the similarity of data distributions and incorporate feature weighting in the similarity computation, allowing us to select training and test sets that best preserve the distributional characteristics of the original dataset.