Statistical Order in Representation Learning: Sufficiency, Architectural Blindness, and Generalization Bounds via Maximum Entropy

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We establish three formal results about the role of statistical order in representation learning. (i) Sufficiency: under a maximum-entropy (Gibbs) data model, a representation that preserves all K-order multipoint statistics is a sufficient statistic for the model’s natural parameters, and is equivalent to saturating the mutual information I(Z; Φ K (X)) = H(Φ K (X)). (ii) Architectural blindness: for any r < K, there exist distributions that agree on all r-order statistics yet differ at order K; consequently, no function with interaction order r can distinguish them. This provides a formal explanation for the texture bias of convolutional networks and their failure on shape-based tasks. (iii) Generalization bounds: out-of-distribution error for any predictor depending on K-order statistics is bounded by the K-order statistical discrepancy D K between training and test distributions, an interpretable, computable quantity related to polynomial-kernel MMD. We develop these results within a unified maximum-entropy framework that places data distributions as exponential families constrained by multipoint statistics, connecting statistical physics, biological vision, and machine learning. Primate visual cortex provides independent motivation: V2 — but not V1 — is selectively sensitive to higher-order multipoint correlations (Yu et al., 2015), suggesting the brain incrementally constructs sufficient statistics for increasing order K. We validate all three predictions on controlled binary-texture stimuli that isolate statistical structure at each order while holding lower-order statistics fixed. At every target order K ∈ { 2, 3, 4 } , classifiers using order-r < K features perform at chance while order-r = K features yield perfect accuracy, providing an exact empirical realization of the architectural blindness theorem. D K tracks the generalization gap of a diffusion model ( ρ = 0.94), and transformer layer depth monotonically increases K-sufficiency, consistent with the information-theoretic characterization. These results unify texture bias, adversarial vulnerability, and distribution shift under a single statistical framework, and suggest D K as a principled pre-deployment diagnostic for generalization gaps.

Article activity feed