A Randomized-Clinical Trial of Two Ambient Artificial Intelligence Scribes: Measuring Documentation Efficiency and Physician Burnout
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Importance
Ambient artificial intelligence (AI) scribes record patient encounters and generate visit notes almost instantaneously, representing a promising solution to documentation burden and associated physician burnout. Despite swift and widespread adoption of AI scribes, their impacts have not been examined in randomized-clinical trials.
Objective
To test the effectiveness of two AI scribes in reducing time spent writing notes and associated burnout in a randomized-clinical trial.
Design
Parallel three-arm pragmatic randomized-clinical trial where physicians were assigned 1:1:1 via covariate-constrained randomization (balancing on time-in-note, baseline burnout score, and clinic days /week) to either one of two AI scribe applications—Microsoft DAX or Nabla—or a usual-care control group from 11/4/2024-1/3/2025.
Setting
A large academic health system in California.
Participants
313 outpatient physicians were recruited based on leadership referrals and department-wide emails. 238 participants representing 14 specialties qualified.
Intervention
Intervention-arm physicians gained access to an AI scribe for two months.
Main Outcomes and Measures
The primary outcome was change from baseline log writing time-in-note. Secondary outcomes measured by surveys included Mini-Z 2.0, 4-item physician task load (TL), and Professional Fulfillment Index-Work Exhaustion (PFI-WE) scores to evaluate aspects of burnout, work environment, and stress, as well as targeted questions addressing safety and accuracy.
Results
DAX was used in 33.5% of 24,696 visits; Nabla was used in 29.5% of 23,653 visits. Nabla users experienced a 9.5% [95% CI:-17.2%,-1.8%] (p=.02) decrease in time-in-note versus the control group and a 7.8% [-15.5%,-0.1%] (p=.05) decrease versus DAX users, while DAX users exhibited no significant change versus control (-1.7% [-9.4%,+5.9%]; p=.66). Total Mini-Z, scaled 10-50 with higher scores indicating improvement, increased with users of any scribe (+2.76 [+1.41,+4.10]; p<.001). Reductions in TL (scale 0-400, TL=-35.8 [-63.7,-7.9]; p=.01) and work exhaustion (scale 0-4, PFI-WE=-0.27 [-0.48,-0.07]; p=.01) were seen with users of any scribe. One Grade 1 (mild) adverse event was reported, while clinically-significant inaccuracies were noted “occasionally” on 5-point Likert questions (DAX 2.7 [2.4-3.0] vs. Nabla 2.8 [2.6-3.0]; p=.68).
Conclusion and Relevance
Use of Nabla reduced time-in-note, while use of any scribe led to modest improvements in physician burnout, work exhaustion, and task load. Performance was remarkably similar across two distinct vendor platforms, and occasional inaccuracies observed in either scribe require ongoing physician vigilance.
Trial Registration
ClinicalTrials.gov Identifier: NCT06792890