Modeling the structure of high-dimensional data using a multivariate Bernoulli distribution
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Recent experiments have uncovered that several high-dimensional datasets have a high probability of forming binary clusters after projecting the data on a randomly chosen one-dimensional subspace. This paper presents a probability model for the high-dimensional data that could explain this phenomenon. The model consists of several Bernoulli random variables which describe a parallelotope frame. The parallelotope forms a skeleton on which noise can be added. While clusters in high dimension are difficult to observe or may not exist at all, the groupings of sample points drawn from such a distribution can be easily identified. Such groupings can be used in place of clustering, especially when the dataset is small, and the statistical significance of these groups can be tested. The existence of such structure in the underlying probability model of a dataset allows for semantic grouping of datapoints based on different criteria. More generally, it provides a simple, binary representation for data points where each bit represents the group membership for each criterion, which could potentially be used for data compression.