Character Semantic-Phonetic Structure Enhance Language Models in Classical Chinese
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Writing systems, as fundamental media for human knowledge preservation and facilitating sociocultural interaction, embody the cognitive characteristics of diverse cultures. Despite the rapid advancement of artificial intelligence and language modeling, mainstream language models remain focused on word-level representations, neglecting the rich semantic and structural information embedded within individual characters. In this work, we present an attempt to integrate semantic-phonetic structural information into language modeling, leveraging classical Chinese as the underlying representative system. Our semantic-phonetic-aware language model achieves significant performance gains over the baseline on two core Classical Chinese processing tasks: Word Segmentation and Part-of-Speech Tagging, unveiling the critical role of internal character structure in enhancing language models' representation and encoding capabilities. Through extensive evaluation and in-depth analysis, we reveal that the improvement originates from the synergistic interaction between semantic and phonetic components, as well as the sequential organization of the semantic-phonetic structure. We further argue that achieving the optimal balance between these components is crucial for enhancing the expressiveness and generalizability of language representations. Overall, our findings highlight that the internal organization of characters not only reflects deeper structural principles of the writing system but also opens up a new paradigm for advancing language model design.