A Generalized Geometric Theory of Centroid Discriminant Analysis for Linear Classification of Multi-dimensional Data

Yue Wu
Jialin Zhao
Carlo Vittorio Cannistraci

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

With the advent of the neural network era, traditional machine learning methods have increasingly been overshadowed. Nevertheless, continuing to research about the role of geometry for learning in data science is crucial to envision and understand new principles behind the design of efficient machine learning. Nonlinear classifiers often build upon linear ones by leveraging shared underlying properties, with their performance largely dependent on the effectiveness of the foundational linear model. Linear classifiers are favored in certain tasks due to their reduced susceptibility to overfitting and their ability to provide interpretable decision boundaries. In biomedical data science, employing an efficient linear classifier is often the first step in assessing the intrinsic complexity of a dataset. However, achieving both scalability and high predictive performance in linear classification remains a persistent challenge. Here, we propose a theoretical framework named geometric discriminant analysis (GDA). GDA includes the family of linear classifiers that can be expressed as function of a centroid discriminant basis (CDB0) - the connection line between two centroids - adjusted by geometric corrections under different constraints. We demonstrate that linear discriminant analysis (LDA) is a subcase of the GDA theory, and we show its convergence to CDB0 under certain conditions. Then, based on the GDA framework, we propose an efficient linear classifier named centroid discriminant analysis (CDA) which is defined as a special case of GDA under a two-dimensional (2D) plane geometric constraint. CDA training is initialized starting from CDB0 and involves the iterative calculation of new adjusted centroid discriminant lines whose optimal rotations on the associated 2D planes are searched via Bayesian optimization. CDA has good scalability (quadratic time complexity) which is lower than LDA and support vectors machine (SVM) (cubic complexity). Results on 27 real datasets across classification tasks of standard images, medical images and chemical properties, offer empirical evidence that CDA outperforms other linear methods such as LDA, SVM and fast SVM in terms of scalability, performance and stability. Furthermore, we show that linear CDA can be generalized to nonlinear CDA via kernel method, demonstrating improvements on the linear version with tests on two challenging datasets in tasks such as classifications of images and chemical data. GDA theory may inspire the design of new linear and nonlinear classifiers under the definition of different geometric constraints. GDA general validity as a new theory for designing machine learning can pave the way towards more deeper understanding of the role of geometry in learning from data.

Version published to 10.20944/preprints202502.0789.v2
May 27, 2025
Version published to 10.20944/preprints202502.0789.v1
Feb 11, 2025

Ensemble of Neural Networks Augmented with Noise Elimination

This article has 4 authors:
1. Chapala Maharan
2. Ch Sanjeev Kumar Dash
3. Ajit Kumar Behera
4. Satchidananda Dehuri
This article has no evaluationsLatest version May 2, 2025
Data Analysis Using Manifold Learning: The RDSF Algorithm

This article has 2 authors:
1. Mehdi Nadjafikhah
2. Mohammad Nasiri
This article has no evaluationsLatest version May 26, 2025
Evaluating the Efficacy of Bayesian Optimization for Class-Imbalanced Data: Jointly Optimizing Classifier Hyperparameters and Sampling Strategies

This article has 1 author:
1. Graham Glasheen
This article has no evaluationsLatest version Apr 15, 2025

Listed in

Abstract

Article activity feed

Related articles

Ensemble of Neural Networks Augmented with Noise Elimination

Data Analysis Using Manifold Learning: The RDSF Algorithm

Evaluating the Efficacy of Bayesian Optimization for Class-Imbalanced Data: Jointly Optimizing Classifier Hyperparameters and Sampling Strategies