MDL-AE: Investigating the Trade-Off Between Compressive Fidelity and Discriminative Utility in Self-Supervised Learning

Zaryab Rahman
Mattia Ottoborgo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Current paradigms in Self-Supervised Learning (SSL) achieve state-of-the-art results through complex, heuristic-driven pretext tasks such as contrastive learning or masked image modeling. This work proposes a departure from these heuristics by reframing SSL through the fundamental principle of Minimum Description Length (MDL). We introduce the MDL-Autoencoder (MDL-AE), a framework that learns visual representations by optimizing a VQ-VAE-based objective to find the most efficient, discrete compression of visual data. We conduct a rigorous series of experiments on CIFAR-10, demonstrating that this compression-driven objective successfully learns a rich vocabulary of local visual concepts. However, our investigation uncovers a critical and non-obvious architectural insight: despite learning a visibly superior and higher-fidelity vocabulary of visual concepts, a more powerful tokenizer fails to improve downstream performance, revealing that the nature of the learned representation dictates the optimal downstream architecture. We show that our MDL-AE learns a vocabulary of holistic object parts rather than generic, composable primitives. Consequently, we find that a sophisticated Vision Transformer (ViT) head, a state-of-the-art tool for understanding token relationships, consistently fails to outperform a simple linear probe on the flattened feature map. This architectural mismatch reveals that the most powerful downstream aggregator is not always the most effective. To validate this, we demonstrate that a dedicated self-supervised alignment task, based on Masked Autoencoding of the discrete tokens, resolves this mismatch and dramatically improves performance, bridging the gap between generative fidelity and discriminative utility. Our work provides a compelling end-to-end case study on the importance of co-designing objectives and their downstream architectures, showing that token-specific pre-training is crucial for unlocking the potential of powerful aggregators.

Version published to 10.20944/preprints202511.2123.v2
Jan 12, 2026
Version published to 10.20944/preprints202511.2123.v1
Nov 27, 2025

DFMIR-Net: Dual-Frequency Mamba Network for Single-Image Deraining

This article has 5 authors:
1. Yun Jiang
2. Keyuan Zhou
3. Kunyi Zhu
4. Xijie Wang
5. Pengyu Chen
This article has no evaluationsLatest version Jan 8, 2026
Self-Supervised Audio Representation Learning Model Based on Time-Frequency Decoupling and Masked Reconstruction

This article has 3 authors:
1. Jie Xu
2. Yuhao Dai
3. Zhifeng Wang
This article has no evaluationsLatest version Dec 31, 2025
Task-Guided Quantization Strategies

This article has 1 author:
1. Igor Szoboszlai
This article has no evaluationsLatest version Dec 23, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

DFMIR-Net: Dual-Frequency Mamba Network for Single-Image Deraining

Self-Supervised Audio Representation Learning Model Based on Time-Frequency Decoupling and Masked Reconstruction

Task-Guided Quantization Strategies