Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

Anuj Attri
HariOm .

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The integration of visual and linguistic reasoning within a unified computational framework remains a fundamental challenge in multimodal artificial intelligence. This work presents Multi-Modal UNIT (MMU), a transformer-based architecture designed to jointly learn from image as well as text modalities through a single-stream attention mechanism. Unlike traditional dual-encoder or late-fusion approaches, MMU employs lightweight modality adapters that enable fine-grained cross-modal interaction while maintaining architectural efficiency. The model is optimized using a hybrid objective that combines contrastive learning for cross-modal understanding with generative learning for language-conditioned reasoning and caption synthesis. Comprehensive evaluations on standard benchmarks, including COCO, Flickr30k, VQAv2, and NLVR2, demonstrate that MMU achieves 92.4% accuracy and an F1-score of 0.93, while maintaining a compact 210-million-parameter design and an average inference time of 70 milliseconds per sample. The results indicate that MMU provides a scalable and efficient pathway toward general-purpose multimodal intelligence, unifying perception, and generation within a single transformer backbone. The complete implementation of this work is publicly available on Zenodo at https://doi.org/10.5281/zenodo.17499887.

Version published to 10.21203/rs.3.rs-8009235/v1 on Research Square
Nov 4, 2025

HISF: Hierarchical Interactive Semantic Fusion for Multi-Modal Prompt Learning

This article has 2 authors:
1. Haohan Feng
2. Chen Li
This article has no evaluationsLatest version Nov 11, 2025
Attention Re-Alignment in Multimodal Large Language Models via Intermediate-Layer Guidance

This article has 6 authors:
1. Yanming Chen
2. Pandong Wang
3. Guofeng Qin
4. Wei Wu
5. Ming Chen
6. Yongtao Hao
This article has no evaluationsLatest version Nov 14, 2025
FiT: Feature Integration Transformer with Universal Language Interface for Multi-Task Vision

This article has 6 authors:
1. Sana Cheema
2. Ghulam Gilanie
3. Tariq Alsahfi
4. Sami Alesawi
5. Raed Alsini
6. Ali Daud
This article has no evaluationsLatest version Oct 9, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

HISF: Hierarchical Interactive Semantic Fusion for Multi-Modal Prompt Learning

Attention Re-Alignment in Multimodal Large Language Models via Intermediate-Layer Guidance

FiT: Feature Integration Transformer with Universal Language Interface for Multi-Task Vision