SensorySync: Multimodal Integration Framework for Unified Perceptual Understanding

Zephyr Lawson
Ava Martinez
Seraphina Quinn

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Generic text embeddings have demonstrated considerable success across a multitude of applications. However, these embeddings are typically derived by modeling the co-occurrence patterns within text-only corpora, which can limit their ability to generalize effectively across diverse contexts. In this study, we investigate methodologies that incorporate visual information into textual representations to overcome these limitations. Through extensive ablation studies, we introduce a novel and straightforward architecture named VisualText Fusion Network (VTFN). This architecture not only surpasses existing multimodal approaches on a range of well-established benchmark datasets but also achieves state-of-the-art performance on image-related textual datasets while utilizing significantly less training data. Our findings underscore the potential of integrating visual modalities to substantially enhance the robustness and applicability of text embeddings, paving the way for more nuanced and contextually rich semantic representations.

Version published to 10.20944/preprints202409.1969.v1
Sep 25, 2024

CEMG: Collaborative-Enhanced Multimodal Generative Recommendation

This article has 6 authors:
1. Yuzhen Lin
2. Hongyi Chen
3. Xuanjing Chen
4. Shaowen Wang
5. Ivonne Xu
6. Dongming Jiang
This article has no evaluationsLatest version Dec 29, 2025
CLARA: Enhancing Multimodal Sentiment Analysis via Efficient Vision-Language Fusion

This article has 3 authors:
1. Phuong Lam
2. Phan Thi Tuoi
3. Thien Khai Tran
This article has no evaluationsLatest version Jan 7, 2026
Reassessing Multimodal Pathways for Learning Action Meaning

This article has 4 authors:
1. Bastien Morel
2. Anaïs Coppens
3. Elodie Fairchild
4. Mathieu Hoorde
This article has no evaluationsLatest version Dec 22, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

CEMG: Collaborative-Enhanced Multimodal Generative Recommendation

CLARA: Enhancing Multimodal Sentiment Analysis via Efficient Vision-Language Fusion

Reassessing Multimodal Pathways for Learning Action Meaning