Advancing Chinese Survey Document Retrieval: Multilingual Models and Semantic Understanding for Structured Information Extraction

Yixin Xia

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rapid digitalization of archival documents in China has generated an overwhelming volume of survey documents, requiring efficient retrieval systems for both research and administrative purposes. This study aims to explore innovative approaches for Chinese survey document retrieval by leveraging multilingual pre-trained models, advanced semantic understanding techniques, and structured data extraction. The research will focus on addressing challenges unique to the Chinese language, such as complex character semantics, idiomatic phrases, and hierarchical content structures in survey documents. By integrating natural language processing (NLP), optical character recognition (OCR), and retrieval models, the study seeks to enhance document accessibility, improve search accuracy, and facilitate information retrieval across diverse Chinese survey datasets.

Version published to 10.21203/rs.3.rs-5603829/v1 on Research Square
Dec 10, 2024

Unsupervised Keyword Extraction Models Using Word-Set Averaging

This article has 1 author:
1. Pejman Khadivi
This article has no evaluationsLatest version Dec 15, 2025
Multilingual Rag Agents For Localized Knowledge: Adaptive Indexing For Under-Represented Languages

This article has 2 authors:
1. Nnaemeka Kingsley Ugwumba
2. Kelechi Ernest Okechukwu
This article has no evaluationsLatest version Jan 29, 2026
ASRD: Development and Validation of a Large-Scale Arabic Semantic Relation Dataset

This article has 6 authors:
1. Randah Alharbi
2. Tarek Helmy
3. Atika Al-Saghyir
4. Safa Aglan
5. Abdulrahman Alosaimy
6. Husni Al-Muhtaseb
This article has no evaluationsLatest version Dec 10, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Unsupervised Keyword Extraction Models Using Word-Set Averaging

Multilingual Rag Agents For Localized Knowledge: Adaptive Indexing For Under-Represented Languages

ASRD: Development and Validation of a Large-Scale Arabic Semantic Relation Dataset