Adaptive Cluster-Count Autoencoders with Dirichlet Process Priors for Geometry-Aware Single-Cell Representation Learning

Zeyu Fu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Standard autoencoders for single-cell transcriptomics learn latent spaces whose cluster structure emerges only post hoc through K -means or community detection, leaving cluster count and boundary quality uncontrolled during training. Here we ask whether imposing an adaptive nonparametric prior can shift this balance. We equip a feedforward autoen-coder with an online Dirichlet Process Mixture Model (DPMM) prior that refits cluster assignments throughout training and directly regularizes latent compactness and separation. Across 56 scRNA-seq datasets the DPMM prior produces a pronounced geometry– concordance trade-off : cluster compactness (ASW) improves by 127% and Davies–Bouldin overlap drops by 47%, but label-recovery metrics decline (NMI −17%, ARI −21%) and downstream k NN accuracy falls from 0.784 to 0.725. Wilcoxon signed-rank tests confirm that the geometry gains are significant with large Cliff’s δ effects while concordance losses remain bounded and non-significant. A second-stage conditional-flow refinement (DPMM-FM) further improves projection fidelity (DRE 0.751, LSE 0.695, DREX 0.873) at additional concordance cost, revealing a three-tier operating regime: prior-free for label recovery, DPMM for manifold geometry, and DPMM-FM for visualization fidelity. Against 18 external baselines DPMM-Base wins 70.5% of core-metric comparisons ( p <0.05). Gene Ontology enrichment confirms that geometry-improved latent components recover coherent biological programs. Rather than claiming universal superiority, this study characterizes the operating envelope of nonparametric mixture priors and identifies the task contexts— trajectory analysis, manifold visualization, and program-level annotation—where adaptive geometric structure outweighs label-counting accuracy.

Version published to 10.64898/2026.03.26.714611 on bioRxiv
Mar 30, 2026

Tsallis-Gated Autoencoder: A Nonextensive Physics-Informed Approach for Unsupervised Anomaly Detection in Glioblastoma Multiforme RNA-seq Data

This article has 2 authors:
1. Sergio Assuncao Monteiro
2. Fabricio Alves Barbosa da Silva
This article has no evaluationsLatest version May 15, 2026
Revisiting Reconstruction Likelihood: Variational Autoencoders for Biological and Biomedical Data Clustering

This article has 3 authors:
1. Andrej Korenić
2. Ufuk Özkaya
3. Abdulkerim Çapar
This article has no evaluationsLatest version Apr 12, 2026
MuseDrift: Navigating Protein Evolutionary Manifolds with Conditional Discrete Diffusion

This article has 2 authors:
1. Chaoyang Wang
2. Yiquan Wang
This article has no evaluationsLatest version May 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Tsallis-Gated Autoencoder: A Nonextensive Physics-Informed Approach for Unsupervised Anomaly Detection in Glioblastoma Multiforme RNA-seq Data

Revisiting Reconstruction Likelihood: Variational Autoencoders for Biological and Biomedical Data Clustering

MuseDrift: Navigating Protein Evolutionary Manifolds with Conditional Discrete Diffusion