From Rule-Based to DeepSeek R1 – A Robust Comparative Evaluation of Fifty Years of Natural Language Processing (NLP) Models To Identify Inflammatory Bowel Disease Cohorts

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

1.1
1.1.1

Background

Natural language processing (NLP) can identify cohorts of patients with inflammatory bowel disease (IBD) from free text. However, limited sharing of code, models, and datasets continues to hinder progress, and bias in foundation large language models (LLMs) remains a significant obstacle.

1.1.2

Objective

To evaluate 15 open-source NLP models for identifying IBD cohorts, reporting on document-to-patient-level classification, while exploring explainability, generalisability, bias and cost factors.

1.1.3

Design

Fifteen algorithms were assessed, covering fifty years of NLP development: regular expressions, Spacy, bag of words (BOW), term frequency inverse document frequency (TF IDF), Word2Vec, two sentence-based SBERT models, three BERT models (distilBERT, RoBERTa, bioclinicalBERT), and five large language models (LLMs): [Mistral-Instruct-0.3-7B, M42-Health/Llama3-8B, Deepseek-R1-Distill-Qwen-32B, Qwen3-32B, and Deepseek-R1-Distill-Llama-70B]. Models were evaluated based on F1 score, bias, environmental costs (in grams of CO2 emitted), and explainability.

1.1.4

Results

A total of 9311 labelled documents were evaluated. The fine-tuned DistilBERT model achieved the best performance (F1: 94.06%) and was more efficient (230.1g CO2) than all other BERT and LLM models. BOW was also strong (F1: 93.38%) and very low cost (1.63g CO2). LLMs performed less well (F1: 86.65% to 91.58%) and had a higher compute cost (938.5 to 33884.4g CO2), along with more bias.

1.1.5

Conclusion

Older NLP approaches, such as BOW, can outperform modern LLMs in clinical cohort detection when properly trained. While LLMs do not require task-specific pretraining, they are slower, more costly, and less accurate. All models and weights from this study are released as open source to benefit the research community.

Article activity feed