Describing Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM, ViT, and LoRA

Javier Lamar León
Vitor Nogueira
Pedro Salgueiro
Paulo Quaresma

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Describing land cover changes from multi-temporal remote sensing imagery requires capturing both visual transformations and their semantic meaning in natural language. Existing methods often struggle to balance visual accuracy with descriptive coherence. We propose MVLT-LoRA-CC (Multi-modal Vision Language Transformer with Low-Rank Adaptation for Change Captioning), a framework that integrates a Vision Transformer (ViT), a Large Language Model (LLM), and Low-Rank Adaptation (LoRA) for efficient multi-modal learning. The model processes paired temporal images through patch embeddings and transformer blocks, aligning visual and textual representations via a multi-modal adapter. To improve efficiency and avoid unnecessary parameter growth, LoRA modules are selectively inserted only into the attention projection layers and cross-modal adapter blocks rather than being uniformly applied to all linear layers. This targeted design preserves general linguistic knowledge while enabling effective adaptation to remote sensing change description. To assess performance, we introduce the Complementary Consistency Score (CCS) framework, which evaluates both descriptive fidelity for change instances and classification accuracy for no change cases. Experiments on the LEVIR-CC test set demonstrate that MVLT-LoRA-CC generates semantically accurate captions, surpassing prior methods in both descriptive richness and temporal change recognition. The approach establishes a scalable solution for multi-modal land cover change description in remote sensing applications.

Version published to 10.3390/rs18010166
Jan 4, 2026
Version published to 10.20944/preprints202510.2054.v1
Oct 28, 2025

Probabilistic von Mises–Fisher Representation Learning forFew-Shot Remote Sensing Scene Classification

This article has 5 authors:
1. Zhong Ji
2. Ci Liu
3. Hongsheng Zhang
4. Chen Tang
5. Yanwei Pang
This article has no evaluationsLatest version Jan 7, 2026
LGD-DeepLabV3+: An Enhanced Framework for Remote Sensing Semantic Segmentation via Multi-Level Feature Fusion and Global Modeling

This article has 5 authors:
1. Xin Wang
2. Xu Liu
3. Adnan Mahmood
4. Yaxin Yang
5. Xipeng Li
This article has no evaluationsLatest version Jan 21, 2026
An effective framework for accurate semantic segmentation of high-resolution remote sensing images.

This article has 6 authors:
1. Wambugu Naftaly
2. Ruisheng Wang
3. Abubakar Sani-Mohammed
4. Bo Guo
5. Xinchang Zhang
6. Zhijun Wang
This article has no evaluationsLatest version Jan 20, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Probabilistic von Mises–Fisher Representation Learning forFew-Shot Remote Sensing Scene Classification

LGD-DeepLabV3+: An Enhanced Framework for Remote Sensing Semantic Segmentation via Multi-Level Feature Fusion and Global Modeling

An effective framework for accurate semantic segmentation of high-resolution remote sensing images.