Multi-Scale Mixture-of-Experts ControlNet for Real-World Movie Scene Image Super-Resolution
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Image super-resolution (SR) plays a critical role in enhancing visual quality and improving the performance of downstream vision tasks. However, existing SR methods are predominantly trained and evaluated on small-scale, standardized datasets, which limits their generalization and robustness in complex real-world scenarios. As a representative application domain, movie scenes exhibit high structural complexity and visual diversity, often containing special effects, filters, and other non-natural elements that pose additional challenges for SR models. With the rapid development of the film industry and computer vision, a vast amount of high-quality imagery has become available on the web, offering rich external priors that can potentially enhance SR performance. Motivated by this, we propose a novel reference-based SR framework for movie scenes, termed Multi-Scale Mixture-of-Experts ControlNet (MMoEControl). Our approach first retrieves semantically or structurally similar high-quality images from web-scale data based on features extracted from the low-resolution (LR) input, forming a reference image set. We then design a Multi-Scale Mixture-of-Experts (MMoE) framework built upon an improved ControlNet architecture, which injects the multi-scale reference information into a frozen pre-trained diffusion model to guide the generation of HR outputs. The core contributions of our method include a SR-guided reference image retrieval module and a multi-scale conditional ControlNet, which jointly integrate structural and textural cues from the references while leveraging diffusion priors to mitigate the limitations of standard training datasets. Compared to conventional “blind” SR methods that operate without external guidance, MMoEControl explicitly “copies” beneficial features from relevant reference images, significantly improving structural fidelity and detail reconstruction. Experimental results demonstrate that our approach consistently outperforms existing methods on various real-world movie scene datasets, highlighting its strong generalization ability and practical value.