Professionalism Pulse: Development and Validation of a Natural Language Processing Pipeline and Dashboard for Safety Culture Surveillance in NYC Health + Hospitals

Eunbyul Mangut
Regina Wallace

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Professionalism and effective communication are foundational determinants of patient safety and quality of care. Unprofessional behaviors frequently serve as active precursors to adverse clinical events. However, proactive organizational surveillance is often hindered because incident feedback exists primarily as unstructured, free-text data. This study aimed to develop and validate a Natural Language Processing (NLP) pipeline and interactive dashboard to proactively monitor the “professionalism climate” within NYC Health + Hospitals, the largest municipal healthcare delivery system in the United States.

Methods

A high-fidelity synthetic dataset (N=400) was computationally generated to safely mirror historical incident logs across 11 acute facilities without utilizing Protected Health Information (PHI). A rule-based NLP pipeline was developed in R utilizing the tidytext package. Unstructured narrative feedback was tokenized and classified into three core domains: Respect, Safety, and Communication. To validate the pipeline’s accuracy, a 25% random stratified sample (n=100) was evaluated against independent, blinded manual coding performed by two reviewers, with inter-rater reliability measured via Cohen’s Kappa. Finally, an interactive Tableau dashboard was developed to operationalize and visualize these metrics for ongoing surveillance.

Results

The NLP algorithm achieved an overall accuracy of 85.8% (95% CI: 79.0-92.6), with 81.2% sensitivity and 88.9% specificity. The highest domain-specific performance was observed in Communication (88.0% accuracy). Manual validation demonstrated strong inter-rater reliability (k=0.84). Operational analysis via the dashboard revealed that 61.8% of reports occurred during the Tour 2 shift (15:00 to 23:00), aligning with peak operational volume. Furthermore, Respect-related feedback was reported at a disproportionately high frequency during the Tour 3 shift (23:00 to 07:00), accounting for over 50.7% of overnight feedback submissions.

Conclusion

Rule-based NLP successfully transforms qualitative healthcare feedback into structured, actionable intelligence with high specificity. Integrating this pipeline into operational dashboards transitions safety culture surveillance from a reactive, manual exercise to a proactive, scalable system, enabling targeted, data-driven interventions by hospital leadership.

Version published to 10.64898/2026.05.19.26353620 on medRxiv
May 22, 2026

Language-dependent diagnostic safety of medical AI systems: a cross-lingual benchmarking and prospective clinical study

This article has 16 authors:
1. Yuqian Wang
2. Hongyu He
3. Rongpeng Zhu
4. Yunyi Lu
5. Pawit Phadungsaksawasdi
6. Manqiang Peng
7. Zengping Liu
8. Ke Zou
9. Ye Zhang
10. Sien Ping Chew
11. Yih Chung Tham
12. Arian Khorasani
13. Hao Deng
14. Ching-Yu Cheng
15. Jie Yang
16. Dianbo Liu
This article has no evaluationsLatest version May 21, 2026
General-purpose large language models can achieve physician-level accuracy in complex medical data extraction

This article has 2 authors:
1. Manu Rajeev
2. Ananthu Narayan
This article has no evaluationsLatest version Jun 10, 2026
A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety

This article has 24 authors:
1. Joshua Proulx
2. Bryce Daines
3. Michael Barton
4. Molly E. Leonard
5. Joseph A. Garcia
6. Bronson Young
7. Quinn Snell
8. Timothy W. West
9. Sam R. Watson
10. Maryam AlQaseer
11. Mathieu Louiset
12. Muhammad Bilal Maqsood
13. Mary J. Voutt-Goos
14. Caryn Douma
15. Nishaminy Kasbekar
16. Jaclyn Jeffries
17. Wadie Abu-Rahmeh
18. Karen Frush
19. Darshan K. Grewal
20. Mouna Bahsoun
21. Michael Leonard
22. Allan Frankel
23. David C. Classen
24. Stanley L. Pestotnik
This article has no evaluationsLatest version Jun 10, 2026

Discuss this preprint

Listed in

Abstract

Background

Methods

Results

Conclusion

Article activity feed

Related articles

Language-dependent diagnostic safety of medical AI systems: a cross-lingual benchmarking and prospective clinical study

General-purpose large language models can achieve physician-level accuracy in complex medical data extraction

A Three-Tier Operational Benchmark for Evaluating Large Language Models on Hospital Medication Safety