DNet :A depression recognition network combining residual network and vision transformer
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Depression is a prevalent and severe global mental disorder, yet its diagnosis and treatment encounter numerous challenges. This study introduces an innovative depression identification network, termed DNet. Our approach utilizes facial images and local facial images as crucial sources of data. Given that facial expressions of individuals with depression at varying severity levels inherently share similar latent facial features, subtle differences exist across multiple facial regions. To achieve higher recognition accuracy, a method is required to fuse advanced semantic features between local and global features.Therefore,we proposeDNet, comprising two key components: the Feature Extraction Module (FEM) and the Vision Transformer (ViT) Block. Specifically, FEM introduces an attention mechanism that considers both channel and positional information of the feature map. Two FEMs are employed to separately process facial and local facial images, extracting critical features to generate highly semantic information-rich feature maps. Subsequently, the feature maps of both images are concatenated along the channel dimension, and the ViT Block is utilized to comprehensively learn advanced semantic features of local and global information related to different facial expression regions. Finally, a 1×1 convolution layer and a fully connected layer are applied to adjust feature channels, yielding more robust predictive results and ultimately outputting depression prediction scores.We experimentally validate the DNet network on the AVEC2014 dataset and our self-constructed CZ2023 dataset, obtaining results of MAE=6.27, RMSE=7.96, and MAE=7.46, RMSE=9.15, respectively. These results affirm the effectiveness of the proposed method.