Cross-Modal Adaptive Reasoning for Long-Tailed Visual-Linguistic Understanding

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Vision-language models often struggle in practical applications because real-world data follow long-tailed distributions, leaving many rare visual concepts poorly represented. This imbalance leads to biased feature learning and weak semantic alignment for infrequent categories. To address these challenges, we introduce Cross-Modal Adaptive Reasoning for Long-Tailed Visual-Linguistic Understanding (CARL-VU), a framework that strengthens model robustness and generalization, particularly for rare concepts and complex semantic relations. CARL-VU combines a transformer-based encoder–decoder design with a semantic-guided expert routing mechanism that dynamically selects specialized experts based on input content. It further incorporates contrastive distillation to enhance the distinctiveness of tail-class features and adaptive feature augmentation to enrich data diversity. Through a two-stage training scheme, the model learns to handle a wide range of visual–linguistic inputs more effectively. Experiments on long-tailed benchmarks demonstrate clear improvements over existing approaches, and ablation analyses verify the complementary contributions of each component in alleviating long-tail issues.

Article activity feed