Semi-supervised attribute selection for partially labeled multiset-valued data

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In machine learning, when part of the data with labels needs to be pro- cessed, it is termed as a semi-supervised learning algorithm. Dataset with missing attribute values or labels is referred to as incomplete information sys- tem. Addressing incomplete information within a system poses a significant challenge, which can be effectively tackled through the application of rough set theory (R-theory). However, R-theory has its limits, it fails to consider the frequency of an attribute value and then can not well fit the distribu- tion of attribute values. If we consider partially labeled data and replace a missing attribute value with the multiset of all possible attribute values under the same attribute, then it leads to the emergence of partially labeled multiset-valued data. In semi-supervised learning algorithm, in order to save time and cost, a large number of redundant features need to be deleted. This paper studies semi-supervised attribute selection (ss-attribute selec- tion) for partially labeled multiset-valued data. Initially, a partially labeled multiset-valued decision information system (p-MSVDIS) is partitioned into two distinct systems: a labeled multiset-valued decision information system (l-MSVDIS) and an unlabeled multiset-valued decision information system (u-MSVDIS). Subsequently, using the indistinguishable relation, distinguish- able relation, and dependence function, two types of attribute subset impor- tance in a p-MSVDIS are defined. They are the weighted sum of l-MSVDIS and u-MSVDIS determined by the missing rate of labels and can be regarded as a uncertainty measurement (UM) of a p-MSVDIS. Next, an adaptive ss- attribute selection algorithm for a p-MSVDIS is introduced, leveraging the degrees of importance, allowing for automatic adaptation to diverse missing rates. Finally, 10 datasets are used for experiment and statistical analysis, the outcomes show the proposed algorithm has their advantage than some algorithms.

Article activity feed