Pretraining Objective Shapes Cross-Category Generalization in Affective Image Prediction: A Geometric Comparison of Vision Transformer Encoders

Shohei Tsuchimoto
Yuka O Okazaki
Kenichi Yuasa
Sakura Nishijima
Mebuki Izumiya
Makoto Hagihara
Ryo Fujihira
Keiichi Kitajo

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The geometry of representations learned by deep neural networks is shaped jointly by architecture and pretraining objective, yet disentangling these two factors remains difficult. Here we isolate the contribution of pretraining objective by comparing two Vision Transformers from the same backbone family but trained under different objectives: language–image contrastive learning (CLIP) and ImageNet-21k classification. Using continuous Valence–Arousal prediction on the OASIS dataset as a probe of representational quality, we evaluated frozen features under Leave-One-Theme-Out and Leave-One-Category-Out cross-validation, the latter requiring extrapolation to entirely unseen semantic categories. The contrastively pretrained encoder generalized substantially better than the classification-pretrained encoder under both protocols, with the gap widening sharply when held-out categories required cross-category generalization. To characterize why the two representations differ, we developed a geometric analysis of prediction errors, treating per-image errors as vectors in the affective plane and quantifying their spatial structure via weighted phase-locking, trajectory-based occupancy entropy, and effective dimensionality. The classification-pretrained representation collapsed errors into a small number of attractor regions with a strong center-ward pull, whereas the language-aligned representation distributed errors broadly across the affective space. Layer-wise linear probing further revealed that affective information was distributed across depth in the contrastive encoder but increasingly concentrated in deeper layers of the classification encoder, mirroring the texture-bias and category-anchored statistics characteristic of ImageNet-trained representations. These results provide a representation-geometric account of how the choice of pretraining objective, holding architecture constant, determines whether learned features generalize across semantic boundaries or remain confined to category-bound visual regularities.

Highlights

Isolate the effect of pretraining objective by holding the Vision Transformer backbone constant.
Contrastively pretrained features generalize across unseen semantic categories where classification-pretrained features fail.
Introduce a geometric analysis of prediction errors based on phase-locking and occupancy entropy.
Classification pretraining produces concentrated error attractors and a rigid centerward bias.
Affective information is distributed across depth in CLIP but localized in late layers of the classification ViT.

Version published to 10.64898/2026.05.11.724194 on bioRxiv
May 13, 2026

Attention Heatmap Drift in a Contrastively Pretrained Vision–Language Model: A Controlled Matched-Learning-Rate Comparison of Full Fine-Tuning and Low-Rank Adaptation

This article has 1 author:
1. Ruize Xia
This article has no evaluationsLatest version Apr 6, 2026
Multimodal large language models converge on the human-like geometry of abstract emotion

This article has 7 authors:
1. Huiguang He
2. Changde Du
3. Yizhuo Lu
4. Zhongyu Huang
5. Yi Sun
6. Zisen Zhou
7. Shaozheng Qin
This article has no evaluationsLatest version Apr 2, 2026
Facial Expression Recognition in Anime and Manga Characters: A Comparative Study of Vision Transformers and Convolutional Neural Networks

This article has 4 authors:
1. Elia Santoro
2. Luigi Laura
3. Marco Parrillo
4. Valerio Rughetti
This article has no evaluationsLatest version Apr 20, 2026

Discuss this preprint

Listed in

Abstract

Highlights

Article activity feed

Related articles

Attention Heatmap Drift in a Contrastively Pretrained Vision–Language Model: A Controlled Matched-Learning-Rate Comparison of Full Fine-Tuning and Low-Rank Adaptation

Multimodal large language models converge on the human-like geometry of abstract emotion

Facial Expression Recognition in Anime and Manga Characters: A Comparative Study of Vision Transformers and Convolutional Neural Networks