Performance of Google NotebookLM for AI-assisted data extraction and consensus statement generation in a heterogenous systematic review on inflammatory bowel disease, obesity, and cardiometabolic comorbidities: A Methodological Report

Sami Samaan
Jalpa Devi
Matthew Vincent
Shannon Coombs
Priya Sehgal
Mouhand Mouhamed
Victoria Rai
Amanda M. Johnson
Andres J. Yarur
Edward L. Barnes
Parakkal Deepak

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Background

Large language models (LLMs) offer promise for systematic review data extraction, but performance in complex multidisciplinary domains and utility for clinical statement generation remain insufficiently described.

Objectives

To evaluate Google NotebookLM for AI-assisted data extraction and RAND/UCLA consensus statement generation in a systematic review of IBD, obesity, and cardiometabolic comorbidities.

Methods

Studies were organized into domain-specific notebooks; structured prompts generated standardized evidence tables. Two independent reviewers validated outputs against full-text articles using a four-category error classification. Cell-level accuracy and critical accuracy (cells free of major factual errors) were the primary metrics; workflow time was compared against a published conventional extraction benchmark. Concordance between AI-generated and expert-finalized statements was assessed.

Results

Across 57 articles, 1,710 data cells were extracted; 151 (8.83%) were flagged, yielding 91.17% cell-level accuracy. Major factual errors occurred in only 4 cells (0.23%), for a critical accuracy of 99.77%. Most errors were minor omissions (59.6%) or incomplete extractions (30.5%); domain error rates ranged from 7.08% to 11.33%. The pipeline required 17.7 versus a projected 165.1 person-hours (89.3% reduction). PICO-structured prompting generated 70 candidate statements; 58 of 112 finalized panel statements (51.8%) were AI-derived, and 85.7% were retained in the finalized set.

Conclusion

Google NotebookLM demonstrates feasibility as a primary extraction and synthesis tool in a multidisciplinary systematic review, with extractive incompleteness as the principal limitation and substantial time savings over conventional approaches. Its novel application to RAND/UCLA consensus statement generation extends AI-assisted evidence synthesis to clinical consensus generation workflow.

Version published to 10.64898/2026.06.16.26355773 on medRxiv
Jun 26, 2026

General-purpose large language models can achieve physician-level accuracy in complex medical data extraction

This article has 2 authors:
1. Manu Rajeev
2. Ananthu Narayan
This article has no evaluationsLatest version Jun 10, 2026
Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

This article has 5 authors:
1. Jussi Leinonen
2. Juha Knuuttila
3. Siina Pamilo
4. Samu Kurki
5. Miika Koskinen
This article has no evaluationsLatest version Jul 9, 2026
A blinded, counterbalanced rater design for evaluating AI-assisted summarisation of tertiary clinical genomics reports: methodology of the QNOMX-VHIR-CPSP-001 Phase 1 study

This article has 3 authors:
1. James Creeden
2. Marcus Olivecrona
3. Aroa Soriano
This article has no evaluationsLatest version Jun 22, 2026

Discuss this preprint

Listed in

Abstract

Background

Objectives

Methods

Results

Conclusion

Article activity feed

Related articles

General-purpose large language models can achieve physician-level accuracy in complex medical data extraction

Uncertainty-aware extraction of clinical findings from Finnish EHRs using open large language models

A blinded, counterbalanced rater design for evaluating AI-assisted summarisation of tertiary clinical genomics reports: methodology of the QNOMX-VHIR-CPSP-001 Phase 1 study