FiT: Feature Integration Transformer with Universal Language Interface for Multi-Task Vision

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Background: Multi-task learning in computer vision aims to leverage commonalities in representation to handle different tasks effectively. The paper presents FiT, a combined CNN-Transformer model equipped with a Universal Language Interface, which can perform classification, segmentation, image captioning, and grounding simultaneously. Methods: The design utilizes a ResNet-50 backbone along with a Feature Pyramid Network for multi-scale feature extraction, a cross-task attention for task interaction, and an LSTM decoder for language-driven outputs. A progressive training strategy with dynamic loss weighting was implemented to guarantee balanced learning for all tasks. Results: FiT scored 98.0% accuracy on CIFAR-10, 85.0% mIoU on PASCAL VOC-2012, 82.0% BLEU-4 on Flickr30k, and 89.0% IoU@0.5 on COCO-2017 grounding. Additionally, the model optimized the number of parameters by 65% and inference FLOPs by a factor of 4, while still achieving the performance of baseline models or even surpassing it. Conclusions: FiT is a proof-of-concept for a single-vision language framework that provides efficiency, scalability, and firm performance across various tasks. The paper thus constitutes a move towards general-purpose, resource-efficient, multi-modal AI in computer vision. The design utilizes a ResNet-50 backbone along with a Feature Pyramid Network for multi-scale feature extraction, a cross-task attention for task interaction, and an LSTM decoder for language-driven outputs. A progressive training strategy with dynamic loss weighting was implemented to guarantee balanced learning for all tasks.

Article activity feed