FiT: Feature Integration Transformer with Universal Language Interface for Multi-Task Vision
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background: Multi-task learning in computer vision aims to leverage commonalities in representation to handle different tasks effectively. The paper presents FiT, a combined CNN-Transformer model equipped with a Universal Language Interface, which can perform classification, segmentation, image captioning, and grounding simultaneously. Methods: The design utilizes a ResNet-50 backbone along with a Feature Pyramid Network for multi-scale feature extraction, a cross-task attention for task interaction, and an LSTM decoder for language-driven outputs. A progressive training strategy with dynamic loss weighting was implemented to guarantee balanced learning for all tasks. Results: FiT scored 98.0% accuracy on CIFAR-10, 85.0% mIoU on PASCAL VOC-2012, 82.0% BLEU-4 on Flickr30k, and 89.0% IoU@0.5 on COCO-2017 grounding. Additionally, the model optimized the number of parameters by 65% and inference FLOPs by a factor of 4, while still achieving the performance of baseline models or even surpassing it. Conclusions: FiT is a proof-of-concept for a single-vision language framework that provides efficiency, scalability, and firm performance across various tasks. The paper thus constitutes a move towards general-purpose, resource-efficient, multi-modal AI in computer vision. The design utilizes a ResNet-50 backbone along with a Feature Pyramid Network for multi-scale feature extraction, a cross-task attention for task interaction, and an LSTM decoder for language-driven outputs. A progressive training strategy with dynamic loss weighting was implemented to guarantee balanced learning for all tasks.