Developing and testing a framework for coding general practitioners’ free-text diagnoses in electronic medical records - a reliability study for generating training data in natural language processing

Audrey Wallnöfer
Jakob M. Burgstaller
Katja Weiss
Thomas Rosemann
Oliver Senn
Stefan Markun

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Diagnoses entered by general practitioners into electronic medical records have great potential for research and practice, but unfortunately, diagnoses are often in uncoded format, making them of little use. Natural language processing (NLP) could assist in coding free-text diagnoses, but NLP models require local training data to unlock their potential. The aim of this study was to develop a framework of research-relevant diagnostic codes, to test the framework using free-text diagnoses from a Swiss primary care database and to generate training data for NLP modelling.

Methods

The framework of diagnostic codes was developed based on input from local stakeholders and consideration of epidemiological data. After pre-testing, the framework contained 105 diagnostic codes, which were then applied by two raters who independently coded randomly drawn lines of free text (LoFT) from diagnosis lists extracted from the electronic medical records of 3000 patients of 27 general practitioners. Coding frequency and mean occurrence rates (n and %) and inter-rater reliability (IRR) of coding were calculated using Cohen’s kappa (Κ).

Results

The sample consisted of 26,980 LoFT and in 56.3% no code could be assigned because it was not a specific diagnosis. The most common diagnostic codes were, ‘dorsopathies’ (3.9%, a code covering all types of back problems, including non-specific lower back pain, scoliosis, and others) and ‘other diseases of the circulatory system’ (3.1%). Raters were in almost perfect agreement (Κ ≥ 0.81) for 69 of the 105 diagnostic codes, and 28 codes showed a substantial agreement (K between 0.61 and 0.80). Both high coding frequency and almost perfect agreement were found in 37 codes, including codes that are particularly difficult to identify from components of the electronic medical record, such as musculoskeletal conditions, cancer or tobacco use.

Conclusion

The coding framework was characterised by a subset of very frequent and highly reliable diagnostic codes, which will be the most valuable targets for training NLP models for automated disease classification based on free-text diagnoses from Swiss general practice.

Version published to 10.1186/s12875-024-02514-1
Jul 16, 2024
Version published to 10.21203/rs.3.rs-4131283/v1 on Research Square
Apr 12, 2024

Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

This article has 5 authors:
1. Rutger Chris Neeleman
2. Berke Yazan
3. Emily Westerbeek
4. Wouter van Ballegooijen
5. Rens van de Schoot
This article has no evaluationsLatest version Jan 26, 2026
Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

This article has 8 authors:
1. Lu He
2. D. Phuong Do
3. Vishesh Girish Shet
4. Omar Farghaly
5. Priya Deshpande
6. Praveen Madiraju
7. Jiancheng Ye
8. Molly Beestrum
This article has no evaluationsLatest version Jan 16, 2026
Development and internal validation of a machine learning–based prediction model and simplified screening score for in-hospital falls: a retrospective cohort study

This article has 9 authors:
1. Onishi Tatsuki
2. Tatsuyoshi Ikenoue
3. Norihide Itoh
4. Takumi Nishioka
5. Keima Nagasaka
6. Ryo Okochi
7. Haru Adachi
8. Naoko Matsuo
9. Yoshiya Ueno
This article has no evaluationsLatest version Jan 23, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

Updated Approach to Error Rates in Systematic Review Screening: Integrating Active Learning, Large Language Models, and Full-Text Screening Data

Large Language Model Biases in Healthcare: A Scoping Review and Call for an Integrated Assessment Framework

Development and internal validation of a machine learning–based prediction model and simplified screening score for in-hospital falls: a retrospective cohort study