LexiAlign: A Diffusion Model Text Alignment and Refinement Method Based on Local Regeneration

Weijia Zhu
Xinjin Li
Jing Pu
Jing Tan
Minglu Wang

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Despite the recent advances of diffusion models in high-quality image synthesis, generating visually accurate textual content remains a persistent bottleneck—commonly manifesting as misspellings, glyph distortions, and semantic drift. We introduce \textbf{LexiAlign}, a \emph{language-guided local diffusion refinement} framework that directly targets these issues through three tightly coupled components: robust \emph{optical character recognition} (OCR)-based text extraction, \emph{language model}-driven semantic correction, and high-fidelity local inpainting via masked diffusion. Unlike prior approaches that retrain large diffusion backbones or overwrite entire regions, LexiAlign performs \emph{character-level targeted repair} while preserving surrounding visual context and style. To support systematic training and evaluation, we construct \textbf{SynOCRText}, a 120k-sample benchmark covering 8 languages, over 20 fonts, diverse layouts, and fine-grained error masks. On SynOCRText, LexiAlign achieves \textbf{88.4\% OCR accuracy} (+6.3\% over the best baseline), Contrastive Language-Image Pretraining (CLIP) Score of 0.852 (+0.023), Peak Signal-to-Noise Ratio (PSNR) of 30.92\,dB (+2.51\,dB), and Structural Similarity Index Measure (SSIM) of 0.893 (+0.020). These results establish LexiAlign as a \emph{plug-and-play, domain-agnostic} solution for reliable visual text alignment, offering both \emph{quantitative superiority} and \emph{practical deployability} for creative design, advertising, and multimodal content generation.

Version published to 10.21203/rs.3.rs-7389501/v1 on Research Square
Sep 4, 2025

HAST: A New Style Transfer Network Integrating Convolution and Attention Mechanism

This article has 4 authors:
1. Kunyun Wu
2. Yang Xu
3. Bin Cao
4. Caideng Zhang
This article has no evaluationsLatest version Sep 11, 2025
Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

This article has 1 author:
1. K. AKILA
This article has no evaluationsLatest version Sep 1, 2025
Scene Text Recognition via Alternating Hierarchical-Global Attention in Encoder-Only Transformers

This article has 3 authors:
1. Shashank B N
2. S. Nagesh Bhattu
3. Sri Phani Krishna K
This article has no evaluationsLatest version Oct 1, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

HAST: A New Style Transfer Network Integrating Convolution and Attention Mechanism

Zero-Shot Image Super-Resolution Using Prompt-Driven Vision-Language Foundation Models Without Task-Specific Fine-Tuning

Scene Text Recognition via Alternating Hierarchical-Global Attention in Encoder-Only Transformers