SCAT: The Self-Correcting Aesthetic Transformer for Explainable Facial Beauty Prediction

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Modeling human aesthetic perception is a fundamental challenge in computer vision. While deep learning has significantly advanced Facial Beauty Prediction (FBP), state-of-the-art models suffer from two critical, interlinked limitations: a performance plateau with Pearson Correlation (PC) coefficients seldom exceeding 0.90, and a ”black box” nature that offers no insight into their reasoning. We posit that these limitations stem from a failure to emulate the hierarchical, part-based reasoning inherent to human aesthetic judgment. In this work, we propose the Self-Correcting Aesthetic Transformer (SCAT), a novel, explainable-by-design framework that overcomes these challenges. SCAT introduces a two-stage architecture featuring a Semantic Parser to disentangle the face into explicit part embeddings (e.g., eyes, mouth) and a Corrector Aggregator to reason about their harmonious interplay. The model is trained with a novel self-correcting loss that enforces internal consistency between its part-based and holistic evaluations. To facilitate this, we present FBP5500-Subscores, a large-scale dataset with granular part-level aesthetic annotations. Extensive experiments demonstrate that SCAT achieves a new state-of-the-art Pearson Correlation of 0.935, thereby breaking the long-standing performance barrier, while simultaneously providing transparent , human-intelligible predictions. Our work bridges the critical gap between 1 predictive power and interpretability in FBP and suggests a structured reasoning paradigm for other subjective visual assessment tasks.

Article activity feed