CAT: Content-Adaptive Image Tokenization

Junhong Shen
Kushal Tirumala
Michihiro Yasunaga
Ishan Misra
Luke Zettlemoyer
Lili Yu
Chunting Zhou

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Most existing image tokenizers encode images into a fixed number of tokens or patches, overlooking the inherent variability in image complexity. To address this, we introduce Content-Adaptive Tokenizer (CAT), which dynamically adjusts representation capacity based on the image content and encodes simpler images into fewer tokens. We design a caption-based evaluation system that leverages large language models (LLMs) to predict content complexity and determine the optimal compression ratio for a given image, taking into account factors critical to human perception. Trained on images with diverse compression ratios, CAT demonstrates robust performance in image reconstruction. We also utilize its variable-length latent representations to train Diffusion Transformers (DiTs) for ImageNet generation. By optimizing token allocation, CAT improves the FID score over fixed-ratio baselines trained with the same flops and boosts the inference throughput by 18.5%.

Version published to 10.32388/wcbnq2
Jan 17, 2025

An Efficient and Training-Free Approach for Subject-Driven Text-to-Image Generation

This article has 3 authors:
1. Gregory Yu
2. Ian Butler
3. Aaron Collins
This article has no evaluationsLatest version Jan 9, 2026
<p class="MDPI12title"><a name="_Hlk215587133"></a>A Convolutional Autoencoder-Based Method for Vector Curve Data Compression

This article has 4 authors:
1. Shuo Zhang
2. Pengcheng Liu
3. Hongran Ma
4. Mingwu Guo
This article has no evaluationsLatest version Dec 24, 2025
Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition

This article has 5 authors:
1. Peng Yang
2. Xiaoguang Wu
3. Yanyang Zhou
4. Qilin Wei
5. Zhifeng Zeng
This article has no evaluationsLatest version Dec 17, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

An Efficient and Training-Free Approach for Subject-Driven Text-to-Image Generation

<p class="MDPI12title"><a name="_Hlk215587133"></a>A Convolutional Autoencoder-Based Method for Vector Curve Data Compression

Multimodal Model Based on Contrastive Language-Image Pretraining for Micro-Expression Recognition