MDL-AE: Investigating the Trade-Off Between Compressive Fidelity and Discriminative Utility in Self-Supervised Learning

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Current paradigms in Self-Supervised Learning (SSL) achieve state-of-the-art results through complex, heuristic-driven pretext tasks such as contrastive learning or masked image modeling. This work proposes a departure from these heuristics by reframing SSL through the fundamental principle of Minimum Description Length (MDL). We introduce the MDL-Autoencoder (MDL-AE), a framework that learns visual representations by optimizing a VQ-VAE-based objective to find the most efficient, discrete compression of visual data. We conduct a rigorous series of experiments on CIFAR-10, demonstrating that this compression-driven objective successfully learns a rich vocabulary of local visual concepts. However, our investigation uncovers a critical and non-obvious architectural insight: despite learning a visibly superior and higher-fidelity vocabulary of visual concepts, a more powerful tokenizer fails to improve downstream performance, revealing that the nature of the learned representation dictates the optimal downstream architecture. We show that our MDL-AE learns a vocabulary of holistic object parts rather than generic, composable primitives. Consequently, we find that a sophisticated Vision Transformer (ViT) head, a state-of-the-art tool for understanding token relationships, consistently fails to outperform a simple linear probe on the flattened feature map. This architectural mismatch reveals that the most powerful downstream aggregator is not always the most effective. To validate this, we demonstrate that a dedicated self-supervised alignment task, based on Masked Autoencoding of the discrete tokens, resolves this mismatch and dramatically improves performance, bridging the gap between generative fidelity and discriminative utility. Our work provides a compelling end-to-end case study on the importance of co-designing objectives and their downstream architectures, showing that token-specific pre-training is crucial for unlocking the potential of powerful aggregators.

Article activity feed