Monitoring Land Cover Changes via Multi-Temporal Remote Sensing Image Captioning Using LLM ViT and LoRA
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Monitoring land cover changes from multi-temporal remote sensing imagery requires detecting visual transformations and describing them in natural language. Existing methods often struggle to balance visual accuracy with linguistic coherence. We propose MVLT-LoRA-CC (multi-modal Vision Language Transformer with Low-Rank Adaptation for Change Captioning), a framework that integrates a Vision Transformer (ViT), a Large Language Model (LLM), and Low-Rank Adaptation (LoRA) for efficient multi-modal learning. The model processes image pairs through convolutional patch embeddings and transformer blocks with self-attention and rotary positional encodings, aligning visual and textual representations via a multi-modal adapter. LoRA enhances fine-tuning efficiency by introducing low-rank trainable matrices, reducing computational cost while preserving linguistic knowledge. We also propose the Complementary Consistency Score (CCS) framework including CCSBMRC, CCSMC, and CCSMCS to jointly evaluate descriptive accuracy for change samples and classification precision for no change cases. Experiments on the LEVIR-CC dataset show that MVLT-LoRA-CC surpasses state of the art methods in semantic and consistency metrics. By integrating vision and language pretraining, the model improves generalization, interpretability, and robustness, establishing a scalable approach for multi-modal Earth observation and environmental monitoring.