Statistical Order in Representation Learning: Sufficiency, Architectural Blindness, and Generalization Bounds via Maximum Entropy

Yunguo Yu

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We establish three formal results about the role of statistical order in representation learning. (i) Sufficiency: under a maximum-entropy (Gibbs) data model, a representation that preserves all K-order multipoint statistics is a sufficient statistic for the model’s natural parameters, and is equivalent to saturating the mutual information I(Z; Φ K (X)) = H(Φ K (X)). (ii) Architectural blindness: for any r < K, there exist distributions that agree on all r-order statistics yet differ at order K; consequently, no function with interaction order r can distinguish them. This provides a formal explanation for the texture bias of convolutional networks and their failure on shape-based tasks. (iii) Generalization bounds: out-of-distribution error for any predictor depending on K-order statistics is bounded by the K-order statistical discrepancy D K between training and test distributions, an interpretable, computable quantity related to polynomial-kernel MMD. We develop these results within a unified maximum-entropy framework that places data distributions as exponential families constrained by multipoint statistics, connecting statistical physics, biological vision, and machine learning. Primate visual cortex provides independent motivation: V2 — but not V1 — is selectively sensitive to higher-order multipoint correlations (Yu et al., 2015), suggesting the brain incrementally constructs sufficient statistics for increasing order K. We validate all three predictions on controlled binary-texture stimuli that isolate statistical structure at each order while holding lower-order statistics fixed. At every target order K ∈ { 2, 3, 4 } , classifiers using order-r < K features perform at chance while order-r = K features yield perfect accuracy, providing an exact empirical realization of the architectural blindness theorem. D K tracks the generalization gap of a diffusion model ( ρ = 0.94), and transformer layer depth monotonically increases K-sufficiency, consistent with the information-theoretic characterization. These results unify texture bias, adversarial vulnerability, and distribution shift under a single statistical framework, and suggest D K as a principled pre-deployment diagnostic for generalization gaps.

Version published to 10.21203/rs.3.rs-9273811/v1 on Research Square
Apr 1, 2026

Manifold geometry underlies a unified code for category and category-independent features

This article has 2 authors:
1. Lorenzo Tiberi
2. Haim Sompolinsky
This article has no evaluationsLatest version Mar 25, 2026
An Information-Theoretic Analysis of Category Maps and Target Preservation

This article has 1 author:
1. Christoph D. Dahl
This article has no evaluationsLatest version May 5, 2026
Variational Quantum Algorithms for Image Classification

This article has 2 authors:
1. Andrej Sum-Shik
2. Peter Thoma
This article has no evaluationsLatest version Apr 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Manifold geometry underlies a unified code for category and category-independent features

An Information-Theoretic Analysis of Category Maps and Target Preservation

Variational Quantum Algorithms for Image Classification