Attention Heatmap Drift in a Contrastively Pretrained Vision–Language Model: A Controlled Matched-Learning-Rate Comparison of Full Fine-Tuning and Low-Rank Adaptation

Ruize Xia

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Downstream adaptation of a contrastively pretrained vision--language model can improve in-domain accuracy while degrading performance on unseen transfer tasks. This study examines how full fine-tuning and low-rank adaptation alter attention heatmaps under a controlled design that matches learning rate across adaptation methods. The completed matched-learning-rate matrix contains 80 runs using the OpenAI Contrastive Language--Image Pretraining model with a base 32-patch vision transformer image encoder, two datasets (EuroSAT and Oxford-IIIT Pets), four shared learning rates (1e-6, 5e-6, 1e-5, and 5e-5), and five random seeds. We measure classification-token-to-patch attention entropy, the fraction of patches required to capture 95\% of attention mass, attention concentration, head diversity, in-domain validation accuracy, and adapter-aware zero-shot accuracy on CIFAR-100. Three findings emerge. First, learning rate is a primary determinant of structural drift: on EuroSAT, full fine-tuning moves from entropy broadening at 1e-6 (+1.83%) to marked contraction at 5e-5 (-3.99%), whereas low-rank adaptation remains entropy-positive across the full matched grid (+0.68% to +1.50%). Second, low-rank adaptation preserves out-of-domain transfer substantially better than full fine-tuning at matched learning rates: averaged across the EuroSAT grid, zero-shot accuracy on CIFAR-100 is 45.13% for low-rank adaptation versus 11.28% for full fine-tuning; on Oxford-IIIT Pets, the corresponding averages are 58.01% and 8.54%. Third, Oxford-IIIT Pets exhibits a clear interaction with optimization scale: low-learning-rate low-rank adaptation underfits the in-domain task, so method-only averages can obscure the regime in which it becomes competitive. Additional rollout, patch-to-patch, centered-kernel-alignment, and backbone analyses are directionally consistent with these controlled results. Across both controlled datasets, runs with broader retained attention support also retain more zero-shot performance. Taken together, these findings support attention heatmap drift as an informative descriptive lens on model adaptation while arguing against a universal interpretation of the observed behavior as a single collapse phenomenon.

Version published to 10.20944/preprints202604.0317.v1
Apr 6, 2026

Task-Conditioned Representation Adaptation for Many-Shot In-Context Learning via Continued Pretraining

This article has 3 authors:
1. Lukas Schneider
2. Anna-Maria Keller
3. Michael Tobias Fischer
This article has no evaluationsLatest version Feb 16, 2026
Continual Test-Time Adaptation: A Comprehensive Survey

This article has 9 authors:
1. Sarthak Kumar Maharana
2. Shambhavi Mishra
3. Yunbei Zhang
4. Shuaicheng Niu
5. Taki Hasan Rafi
6. Jihun Hamm
7. Marco Pedersoli
8. Jose Dolz
9. Yunhui Guo
This article has no evaluationsLatest version Mar 19, 2026
Cross-Domain Evaluation and Fine-Tuned Adaptation of iCatcher+ for Korean Infant Gaze Data

This article has 5 authors:
1. Rajalakshmi Madhavan
2. Jiho Lee
3. Dongjin Lee
4. Jae-Hun Jung
5. Eon-Suk Ko
This article has no evaluationsLatest version Feb 25, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Task-Conditioned Representation Adaptation for Many-Shot In-Context Learning via Continued Pretraining

Continual Test-Time Adaptation: A Comprehensive Survey

Cross-Domain Evaluation and Fine-Tuned Adaptation of iCatcher+ for Korean Infant Gaze Data