A Machine Learning Approach for Nominative Record Linkage in Chinese Historical Databases

Bruce Yu
Yueran Hou
Yibei Wu
Cameron Campbell

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We introduce a generic machine learning-based pipeline for nominative linkage of records within and across large-scale Chinese historical datasets. The pipeline addresses key challenges, including character variations, incomplete data, and scalability issues specific to historical datasets in which names and other attributes are recorded with Chinese characters, not just for China, but potentially for Korea, Japan and Vietnam. Techniques developed for attributes recorded in phonetic alphabets are of limited usefulness for Chinese characters not only because homonyms are common, but characters that are similar enough in appearance to be frequently mistaken for each other may sound completely different. Our approach integrates stroke-based character embeddings for efficient blocking, supervised classification with active learning for record matching, and graph-based clustering for final linkage. We demonstrate the effectiveness of this pipeline using the career records of officials in the China Government Employee Database-Qing Jinshenlu (CGED-Q JSL) as a test case. We achieve improved linkage quality compared to standard probabilistic methods, with substantially longer linked sequences of career records and fewer aberrant transitions. To validate the generalizability, we also successfully apply the pipeline to another database and a cross-database linkage task. By minimizing the need for manual tuning, our pipeline offers a more accessible and effective solution for Chinese historical data linkage.

Version published to 10.31235/osf.io/rthvz_v2 on OSF Preprints
Mar 27, 2026
Version published to 10.31235/osf.io/rthvz_v1 on OSF Preprints
Sep 28, 2025

Ekantipur-15Y: A Longitudinal Benchmark Corpus and Semantic Analysis of Nepali News (2010 - 2025)

This article has 2 authors:
1. Diwash Mainali
2. Utsav Mainali
This article has no evaluationsLatest version Mar 3, 2026
Relation Extraction (RE) Model for Afaan Oromo Text Using Self-Attention Mechanisms

This article has 1 author:
1. Lingerew Bantie
This article has no evaluationsLatest version Feb 26, 2026
NE-OCR: Unified Optical Character Recognition for 10 Languages of Northeast India

This article has 1 author:
1. Badal Nyalang
This article has no evaluationsLatest version Mar 20, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Ekantipur-15Y: A Longitudinal Benchmark Corpus and Semantic Analysis of Nepali News (2010 - 2025)

Relation Extraction (RE) Model for Afaan Oromo Text Using Self-Attention Mechanisms

NE-OCR: Unified Optical Character Recognition for 10 Languages of Northeast India