Geo-TCAM: A Thangka Captioning Method Integrating Topic Modeling with Geometry- Guided Spatial Attention

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Thangka image captioning, an essential task in cultural heritage preservation, faces challenges due to the complexity of Thangka imagery and the depth of their semantic content. Current deep learning-based methods struggle with extracting detailed features and accurately understanding the semantics of Thangka images, often leading to incomplete or incorrect captions of key elements such as the main deity and the background. To address these challenges, this paper introduces a novel Thangka captioning model, integrating topic modeling and geometry-guided spatial attention (Geo-TCAM). The model employs a multi-level feature integration strategy to enhance feature extraction, including gestures and objects. By incorporating Latent Dirichlet Allocation (LDA) topic weights and visual features (TIF), it leverages external domain knowledge for better semantic understanding. The Geo-TCAM's geometry-guided facial spatial attention module (GFSA) improves spatial layout recognition. Experimental results demonstrate significant improvements in captioning performance, with BLEU-1, BLEU-4, METEOR, and CIDEr scores increasing by 11.9%, 17.1%, 9.7%, and 119.5%, respectively, compared to baseline models. On the COCO public dataset, the Geo-TCAM model also demonstrates outstanding performance, comparable to that of other state-of-the-art models. This study employs the Geo-TCAM model to accurately generate image captions for Thangka images, facilitating the digital preservation and dissemination of cultural heritage.

Article activity feed