Facial Expression Recognition in Anime and Manga Characters: A Comparative Study of Vision Transformers and Convolutional Neural Networks

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Facial expression recognition (FER) is a well-established task in computer vision, yet its application to non-photorealistic domains, such as anime and manga, remains largely underexplored. The stylized, exaggerated, and often non-proportional facial features of illustrated characters present unique challenges for deep learning models trained predominantly on realistic imagery. In this work, we construct a balanced dataset of 3,000 manga and anime face images spanning six emotion categories (Angry, Embarrassed, Happy, Psycho-Crazy, Sad, Scared) and conduct a systematic comparison of two major deep learning paradigms: Convolutional Neural Networks (CNNs) and Vision Transformers (ViTs). Specifically, we evaluate ResNet-18, ResNet-50, ViT-B/16, and ViT-S/16 under four fine-tuning strategies: linear probing, partial fine-tuning, full fine-tuning, and progressive unfreezing; enabling a controlled comparison of both architectural families and transfer learning depth. Our results show that fine-tuning strategy significantly impacts performance: the best configuration (ViT-B/16 with progressive unfreezing) achieves 80.89% test accuracy, compared to 61.33% for the weakest linear probe baseline (ViT-S/16), a gap of 19.56 percentage points. Vision Transformers benefit disproportionately from fine-tuning, and the relative ranking of architectures changes across fine-tuning regimes. Confusion matrix analysis reveals persistent cross-class confusion between visually similar emotions (e.g., Happy vs. Embarrassed), while highly distinctive categories such as Psycho-Crazy are consistently well recognized across all architectures.

Article activity feed