Dual-Modality Feature Blending: A Channel-Aware Modeling for Multimodal Integration

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

In this study, we propose \textit{CrossFusionTokens (XFT)}, a novel channel-aware method for integrating visual and linguistic information in multimodal representation learning. Our work is motivated by the increasing demand for robust systems capable of interpreting and reasoning over both visual and textual data. Tasks such as Visual Question Answering (VQA) and Visual Entailment require precise alignment and fusion between language semantics and visual perception, where traditional approaches like unimodal concatenation and symmetric cross-attention fall short in maintaining coherence across modalities. Our method introduces a dual cross-attention mechanism that facilitates bidirectional querying between modalities—first using visual tokens to extract text features, and then reversing the process using text tokens to retrieve visual information. These paired outputs are fused along the channel dimension to form compound representations that encapsulate rich, contextualized information from both inputs. Unlike prior methods that concatenate tokens along the sequence axis, our fusion along the channel dimension maintains token compactness while enriching feature semantics. We validate XFT across three widely used benchmarks—GQA, VQA2.0, and SNLI-VE—demonstrating superior performance to several state-of-the-art fusion approaches. Notably, XFT provides a unified pipeline that combines the advantages of co-attention and merged attention mechanisms without incurring excessive computational costs. This research contributes a scalable and effective solution for advancing vision-language reasoning, paving the way for more general-purpose multimodal understanding systems.

Article activity feed