Two-Stage Fine-Tuning of Large Vision-Language Models with Hierarchical Prompting for Few-Shot Object Detection in Remote Sensing Images

Yongqi Shi
Ruopeng Yang
Changsheng Yin
Yiwei Lu
Bo Huang
Yu Tao
Yihao Zhong

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Few-shot object detection (FSOD) in high-resolution remote sensing (RS) imagery remains challenging due to scarce annotations, large intra-class variability, and high visual similarity between categories, which together limit the generalization ability of convolutional neural network (CNN)-based detectors. To address this issue, we explore leveraging large vision-language models (LVLMs) for FSOD in RS. We propose a two-stage, parameter-efficient fine-tuning framework with hierarchical prompting that adapts Qwen3-VL for object detection. In the first stage, low-rank adaptation (LoRA) modules are inserted into the vision and text encoders and trained jointly with a Detection Transformer (DETR)-style detection head on fully annotated base classes under three-level hierarchical prompts. In the second stage, the vision LoRA parameters are frozen, the text encoder is updated using K-shot novel-class samples, and the detection head is partially frozen, with selected components refined using the same three-level hierarchical prompting scheme. To preserve base-class performance and reduce class confusion, we further introduce knowledge distillation and semantic consistency losses. Experiments on the DIOR and NWPU VHR-10.v2 datasets show that the proposed method consistently improves novel-class performance while maintaining competitive base-class accuracy and surpasses existing baselines, demonstrating the effectiveness of integrating hierarchical semantic reasoning into LVLM-based FSOD for RS imagery.

Version published to 10.3390/rs18020266
Jan 14, 2026
Version published to 10.20944/preprints202512.1915.v2
Dec 23, 2025
Version published to 10.20944/preprints202512.1915.v1
Dec 22, 2025

A Hybrid YOLOv5s-Faster R-CNN Architecture for Object Detection in Complex Road Scenes

This article has 3 authors:
1. Lenard Nkalubo Byenkya
2. Rose Nakibuule
3. Danison Taremwa
This article has no evaluationsLatest version Jan 21, 2026
TriORU2-Net++: Attention-Guided Three-StageU2-Net++ for Light Field Occlusion Removal

This article has 5 authors:
1. Mostafa Farouk Senussi
2. Mahmoud Abdalla
3. Mahmoud SalahEldin Kasem
4. Mohamed Mahmoud
5. Hyun-Soo Kang
This article has no evaluationsLatest version Jan 19, 2026
From Scratch to Fine Tuning: Comparing Transfer Learning and CNN Training Strategies on Five Bangladesh-Centric Datasets

This article has 4 authors:
1. Minhaz Kamal
2. Md. Mushfiqul Haque
3. Rafid Nahiyan Farabi
4. Muhammad Ibrahim
This article has no evaluationsLatest version Jan 9, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

A Hybrid YOLOv5s-Faster R-CNN Architecture for Object Detection in Complex Road Scenes

TriORU2-Net++: Attention-Guided Three-StageU2-Net++ for Light Field Occlusion Removal

From Scratch to Fine Tuning: Comparing Transfer Learning and CNN Training Strategies on Five Bangladesh-Centric Datasets