From Rule-Based to DeepSeek R1 – A Robust Comparative Evaluation of Fifty Years of Natural Language Processing (NLP) Models To Identify Inflammatory Bowel Disease Cohorts

Matthew Stammers
Markus Gwiggner
Reza Nouraei
Cheryl Metcalf
James Batchelor

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

1.1

1.1.1

Background

Natural language processing (NLP) can identify cohorts of patients with inflammatory bowel disease (IBD) from free text. However, limited sharing of code, models, and datasets continues to hinder progress, and bias in foundation large language models (LLMs) remains a significant obstacle.

1.1.2

Objective

To evaluate 15 open-source NLP models for identifying IBD cohorts, reporting on document-to-patient-level classification, while exploring explainability, generalisability, bias and cost factors.

1.1.3

Design

Fifteen algorithms were assessed, covering fifty years of NLP development: regular expressions, Spacy, bag of words (BOW), term frequency inverse document frequency (TF IDF), Word2Vec, two sentence-based SBERT models, three BERT models (distilBERT, RoBERTa, bioclinicalBERT), and five large language models (LLMs): [Mistral-Instruct-0.3-7B, M42-Health/Llama3-8B, Deepseek-R1-Distill-Qwen-32B, Qwen3-32B, and Deepseek-R1-Distill-Llama-70B]. Models were evaluated based on F1 score, bias, environmental costs (in grams of CO2 emitted), and explainability.

1.1.4

Results

A total of 9311 labelled documents were evaluated. The fine-tuned DistilBERT model achieved the best performance (F1: 94.06%) and was more efficient (230.1g CO2) than all other BERT and LLM models. BOW was also strong (F1: 93.38%) and very low cost (1.63g CO2). LLMs performed less well (F1: 86.65% to 91.58%) and had a higher compute cost (938.5 to 33884.4g CO2), along with more bias.

1.1.5

Conclusion

Older NLP approaches, such as BOW, can outperform modern LLMs in clinical cohort detection when properly trained. While LLMs do not require task-specific pretraining, they are slower, more costly, and less accurate. All models and weights from this study are released as open source to benefit the research community.

Version published to 10.1101/2025.07.06.25330961 on medRxiv
Jul 7, 2025

Benchmarking Large Language Models on USMLE: Evaluating ChatGPT, DeepSeek, Grok, and Qwen in Clinical Reasoning and Medical Licensing Scenarios

This article has 7 authors:
1. Md Kamrul Siam
2. Angel Varela
3. Md Jobair Hossain Faruk
4. Jerry Q. Cheng
5. Huanying Gu
6. Abdullah Al Maruf
7. Zeyar Aung
This article has no evaluationsLatest version Jun 12, 2025
A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

This article has 1 author:
1. Snehil Shrivastava
This article has no evaluationsLatest version Jun 16, 2025
A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

This article has 1 author:
1. Snehil Shrivastava
This article has no evaluationsLatest version Jun 16, 2025

Listed in

Abstract

Background

Objective

Design

Results

Conclusion

Article activity feed

Related articles

Benchmarking Large Language Models on USMLE: Evaluating ChatGPT, DeepSeek, Grok, and Qwen in Clinical Reasoning and Medical Licensing Scenarios

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation