Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The integration of visual and linguistic reasoning within a unified computational framework remains a fundamental challenge in multimodal artificial intelligence. This work presents Multi-Modal UNIT (MMU), a transformer-based architecture designed to jointly learn from image as well as text modalities through a single-stream attention mechanism. Unlike traditional dual-encoder or late-fusion approaches, MMU employs lightweight modality adapters that enable fine-grained cross-modal interaction while maintaining architectural efficiency. The model is optimized using a hybrid objective that combines contrastive learning for cross-modal understanding with generative learning for language-conditioned reasoning and caption synthesis. Comprehensive evaluations on standard benchmarks, including COCO, Flickr30k, VQAv2, and NLVR2, demonstrate that MMU achieves 92.4% accuracy and an F1-score of 0.93, while maintaining a compact 210-million-parameter design and an average inference time of 70 milliseconds per sample. The results indicate that MMU provides a scalable and efficient pathway toward general-purpose multimodal intelligence, unifying perception, and generation within a single transformer backbone. The complete implementation of this work is publicly available on Zenodo at https://doi.org/10.5281/zenodo.17499887.