FiT: Feature Integration Transformer with Universal Language Interface for Multi-Task Vision

Sana Cheema
Ghulam Gilanie
Tariq Alsahfi
Sami Alesawi
Raed Alsini
Ali Daud

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background: Multi-task learning in computer vision aims to leverage commonalities in representation to handle different tasks effectively. The paper presents FiT, a combined CNN-Transformer model equipped with a Universal Language Interface, which can perform classification, segmentation, image captioning, and grounding simultaneously. Methods: The design utilizes a ResNet-50 backbone along with a Feature Pyramid Network for multi-scale feature extraction, a cross-task attention for task interaction, and an LSTM decoder for language-driven outputs. A progressive training strategy with dynamic loss weighting was implemented to guarantee balanced learning for all tasks. Results: FiT scored 98.0% accuracy on CIFAR-10, 85.0% mIoU on PASCAL VOC-2012, 82.0% BLEU-4 on Flickr30k, and 89.0% IoU@0.5 on COCO-2017 grounding. Additionally, the model optimized the number of parameters by 65% and inference FLOPs by a factor of 4, while still achieving the performance of baseline models or even surpassing it. Conclusions: FiT is a proof-of-concept for a single-vision language framework that provides efficiency, scalability, and firm performance across various tasks. The paper thus constitutes a move towards general-purpose, resource-efficient, multi-modal AI in computer vision. The design utilizes a ResNet-50 backbone along with a Feature Pyramid Network for multi-scale feature extraction, a cross-task attention for task interaction, and an LSTM decoder for language-driven outputs. A progressive training strategy with dynamic loss weighting was implemented to guarantee balanced learning for all tasks.

Version published to 10.21203/rs.3.rs-7602995/v1 on Research Square
Oct 9, 2025

Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

This article has 2 authors:
1. Anuj Attri
2. HariOm .
This article has no evaluationsLatest version Nov 4, 2025
Enhancing ConvNeXt for efficient small-size image classification

This article has 4 authors:
1. Jianwei Feng
2. Jinguo Mo
3. Hengliang Tan
4. Shuo Yang
This article has no evaluationsLatest version Nov 17, 2025
A Systematic Review of Deep Neural Network Architectures Training Methods and Applications for Image Segmentation

This article has 3 authors:
1. Mojtaba Najafi
2. Vafa Maihami
3. Um Kulthum shahryari
This article has no evaluationsLatest version Nov 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Unified Transformer Framework for Integrated Language -Vision Understanding and Content Generation

Enhancing ConvNeXt for efficient small-size image classification

A Systematic Review of Deep Neural Network Architectures Training Methods and Applications for Image Segmentation