Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

K. AKILA

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The paper proposes a new direction in image super-resolution (SR) through developing a prompt-guided, zero-shot framework based on the semantic properties of Vision-Language Foundation Models (VLFMs) combined with the generative diffusion backbones. Traditional SR models usually demand supervised training with correlated pairs of low-resolution and high-resolution images, being hindered in their adaptiveness to the challenges in the real world and unaccounted image distributions. The present way was suggested to resolve such limitations by removing the necessity of paired data and conceiving the identical process of enhancement as conditioned on descriptive natural language prompts. In VLFMs like BLIP, strong cross-modal representations are obtained by learning rich images on the low-resolution input picture and arbitrary text. Such embeddings inform a diffusion model, like Stable Diffusion, to reconstruct high-quality images by a sequence of denoising operations that maintain semantic alignment and structural integrity. The system implements both the static and dynamic prompt engineering techniques to respond to diverse contexts of images as well as different user intentions. Generalizability on both synthetic and real-world distortions was checked on benchmark datasets such as DIV2K and RealSR. The parameters of quantitative metrics like PSNR, SSIM, LPIPS, FID, and NIQE are used to quantify how well an image was in fidelity, and human-centered assessment was of perceptual realism. The findings indicate that this prompt-based, no-shot pipeline was competitive or better than investing in conventional or unsupervised baselines, particularly where there are no explicit training sets available. This study paves the way to user-controlled, task-free super-resolution via foundation models.

Version published to 10.21203/rs.3.rs-7346896/v1 on Research Square
Sep 1, 2025

RobustOVS: Open-Vocabulary Segmentation with Robustly Semantic-Assisted Calibration

This article has 3 authors:
1. Ruihan Wang
2. Guodong Wang
3. Mingtao Liu
This article has no evaluationsLatest version Aug 11, 2025
Emergent Semantics from Disjoint Modalities: Unsupervised Cross-Domain Vision-Language Grounding

This article has 3 authors:
1. Milan Janssens
2. Noor Peeters
3. Callum Hensley
This article has no evaluationsLatest version Sep 11, 2025
An N-Layered Network (NLN) for High-Resolution Image Inpainting with Multi Spatial Lexical-Texture Fusion (MS-LTF)

This article has 2 authors:
1. Pujari Suresh Kumar
2. Sushama Rani Dutta
This article has no evaluationsLatest version Sep 24, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

RobustOVS: Open-Vocabulary Segmentation with Robustly Semantic-Assisted Calibration

Emergent Semantics from Disjoint Modalities: Unsupervised Cross-Domain Vision-Language Grounding

An N-Layered Network (NLN) for High-Resolution Image Inpainting with Multi Spatial Lexical-Texture Fusion (MS-LTF)