Towards Sustainable Image Synthesis: A Comprehensive Review of Text-to-Image Generation Models

Smita Bharne
Pallavi Sapkale
Ekta Sarda
Shamal Salunkhe
Puja Padiya

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Text-to-image generation represents a rapidly evolving frontier in artificial intelligence, enabling the transformation of natural language descriptions into visually coherent and semantically rich images. This paper presents a comprehensive review of state-of-the-art generative models—including Generative Adversarial Networks (GANs), Variational Autoencoders (VAEs), and advanced Diffusion Models—focusing on their capabilities to produce high-fidelity, contextually accurate images from textual inputs. Additionally, we analyse leading sustainable image synthesis frameworks such as DALL-E 2, Stable Diffusion, Imagen, and MidJourney, assessing their advancements in image quality, semantic alignment, diversity, and computational efficiency. Our systematic evaluation highlights significant progress in generating realistic, high-resolution images while identifying persistent challenges related to semantic consistency, fine-grained control, ethical considerations, and substantial computational demands. We further discuss critical trade-offs between model performance and sustainability, fostering future research directions aimed at developing more efficient, fair, and environmentally responsible text-to-image generation systems. This survey serves as a guiding resource for the next generation of sustainable AI-driven text to image synthesis technologies.

Version published to 10.54392/irjmt2557
Sep 24, 2025
Version published to 10.21203/rs.3.rs-6249181/v1 on Research Square
Apr 16, 2025

Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

This article has 1 author:
1. K. AKILA
This article has no evaluationsLatest version Sep 1, 2025
LexiAlign: A Diffusion Model Text Alignment and Refinement Method Based on Local Regeneration

This article has 5 authors:
1. Weijia Zhu
2. Xinjin Li
3. Jing Pu
4. Jing Tan
5. Minglu Wang
This article has no evaluationsLatest version Sep 4, 2025
MvDeDiffusion: Multi-view Consistent Generation via Cross-view Deformable Attention for Denoising Diffusion Models

This article has 3 authors:
1. Bin Lu
2. Qing Li
3. Yanju Liang
This article has no evaluationsLatest version Aug 27, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

LexiAlign: A Diffusion Model Text Alignment and Refinement Method Based on Local Regeneration

MvDeDiffusion: Multi-view Consistent Generation via Cross-view Deformable Attention for Denoising Diffusion Models