Pretraining Objective Shapes Cross-Category Generalization in Affective Image Prediction: A Geometric Comparison of Vision Transformer Encoders

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The geometry of representations learned by deep neural networks is shaped jointly by architecture and pretraining objective, yet disentangling these two factors remains difficult. Here we isolate the contribution of pretraining objective by comparing two Vision Transformers from the same backbone family but trained under different objectives: language–image contrastive learning (CLIP) and ImageNet-21k classification. Using continuous Valence–Arousal prediction on the OASIS dataset as a probe of representational quality, we evaluated frozen features under Leave-One-Theme-Out and Leave-One-Category-Out cross-validation, the latter requiring extrapolation to entirely unseen semantic categories. The contrastively pretrained encoder generalized substantially better than the classification-pretrained encoder under both protocols, with the gap widening sharply when held-out categories required cross-category generalization. To characterize why the two representations differ, we developed a geometric analysis of prediction errors, treating per-image errors as vectors in the affective plane and quantifying their spatial structure via weighted phase-locking, trajectory-based occupancy entropy, and effective dimensionality. The classification-pretrained representation collapsed errors into a small number of attractor regions with a strong center-ward pull, whereas the language-aligned representation distributed errors broadly across the affective space. Layer-wise linear probing further revealed that affective information was distributed across depth in the contrastive encoder but increasingly concentrated in deeper layers of the classification encoder, mirroring the texture-bias and category-anchored statistics characteristic of ImageNet-trained representations. These results provide a representation-geometric account of how the choice of pretraining objective, holding architecture constant, determines whether learned features generalize across semantic boundaries or remain confined to category-bound visual regularities.

Highlights

  • Isolate the effect of pretraining objective by holding the Vision Transformer backbone constant.

  • Contrastively pretrained features generalize across unseen semantic categories where classification-pretrained features fail.

  • Introduce a geometric analysis of prediction errors based on phase-locking and occupancy entropy.

  • Classification pretraining produces concentrated error attractors and a rigid centerward bias.

  • Affective information is distributed across depth in CLIP but localized in late layers of the classification ViT.

Article activity feed