Modeling the structure of high-dimensional data using a multivariate Bernoulli distribution

Mireille Boutin
Evzenie Coupkova
Marco Morosin

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Recent experiments have uncovered that several high-dimensional datasets have a high probability of forming binary clusters after projecting the data on a randomly chosen one-dimensional subspace. This paper presents a probability model for the high-dimensional data that could explain this phenomenon. The model consists of several Bernoulli random variables which describe a parallelotope frame. The parallelotope forms a skeleton on which noise can be added. While clusters in high dimension are difficult to observe or may not exist at all, the groupings of sample points drawn from such a distribution can be easily identified. Such groupings can be used in place of clustering, especially when the dataset is small, and the statistical significance of these groups can be tested. The existence of such structure in the underlying probability model of a dataset allows for semantic grouping of datapoints based on different criteria. More generally, it provides a simple, binary representation for data points where each bit represents the group membership for each criterion, which could potentially be used for data compression.

Version published to 10.21203/rs.3.rs-8424340/v1 on Research Square
Feb 18, 2026

A Stochastic Block Prior for Clustering in Graphical Models

This article has 6 authors:
1. Nikola Sekulovski
2. Giuseppe Arena
3. Jonas M B Haslbeck
4. Karoline Huth
5. Nial Friel
6. Maarten Marsman
This article has no evaluationsLatest version Feb 26, 2026
A Generalized Geometric Theoretical Framework of Centroid Discriminant Analysis for Linear Classification of Multi-Dimensional Data

This article has 3 authors:
1. Yue Wu
2. Jialin Zhao
3. Carlo Vittorio Cannistraci
This article has no evaluationsLatest version Apr 2, 2026
Using Classification Trees to Identify the Best Method in Monte Carlo Simulations: From Population Parameters to Observed Features

This article has 2 authors:
1. Jeongwon Choi
2. Hao Wu
This article has no evaluationsLatest version Mar 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Stochastic Block Prior for Clustering in Graphical Models

A Generalized Geometric Theoretical Framework of Centroid Discriminant Analysis for Linear Classification of Multi-Dimensional Data

Using Classification Trees to Identify the Best Method in Monte Carlo Simulations: From Population Parameters to Observed Features