LexiAlign: A Diffusion Model Text Alignment and Refinement Method Based on Local Regeneration

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Despite the recent advances of diffusion models in high-quality image synthesis, generating visually accurate textual content remains a persistent bottleneck—commonly manifesting as misspellings, glyph distortions, and semantic drift. We introduce \textbf{LexiAlign}, a \emph{language-guided local diffusion refinement} framework that directly targets these issues through three tightly coupled components: robust \emph{optical character recognition} (OCR)-based text extraction, \emph{language model}-driven semantic correction, and high-fidelity local inpainting via masked diffusion. Unlike prior approaches that retrain large diffusion backbones or overwrite entire regions, LexiAlign performs \emph{character-level targeted repair} while preserving surrounding visual context and style. To support systematic training and evaluation, we construct \textbf{SynOCRText}, a 120k-sample benchmark covering 8 languages, over 20 fonts, diverse layouts, and fine-grained error masks. On SynOCRText, LexiAlign achieves \textbf{88.4\% OCR accuracy} (+6.3\% over the best baseline), Contrastive Language-Image Pretraining (CLIP) Score of 0.852 (+0.023), Peak Signal-to-Noise Ratio (PSNR) of 30.92\,dB (+2.51\,dB), and Structural Similarity Index Measure (SSIM) of 0.893 (+0.020). These results establish LexiAlign as a \emph{plug-and-play, domain-agnostic} solution for reliable visual text alignment, offering both \emph{quantitative superiority} and \emph{practical deployability} for creative design, advertising, and multimodal content generation.

Article activity feed