Performance of Google NotebookLM for AI-assisted data extraction and consensus statement generation in a heterogenous systematic review on inflammatory bowel disease, obesity, and cardiometabolic comorbidities: A Methodological Report
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Background
Large language models (LLMs) offer promise for systematic review data extraction, but performance in complex multidisciplinary domains and utility for clinical statement generation remain insufficiently described.
Objectives
To evaluate Google NotebookLM for AI-assisted data extraction and RAND/UCLA consensus statement generation in a systematic review of IBD, obesity, and cardiometabolic comorbidities.
Methods
Studies were organized into domain-specific notebooks; structured prompts generated standardized evidence tables. Two independent reviewers validated outputs against full-text articles using a four-category error classification. Cell-level accuracy and critical accuracy (cells free of major factual errors) were the primary metrics; workflow time was compared against a published conventional extraction benchmark. Concordance between AI-generated and expert-finalized statements was assessed.
Results
Across 57 articles, 1,710 data cells were extracted; 151 (8.83%) were flagged, yielding 91.17% cell-level accuracy. Major factual errors occurred in only 4 cells (0.23%), for a critical accuracy of 99.77%. Most errors were minor omissions (59.6%) or incomplete extractions (30.5%); domain error rates ranged from 7.08% to 11.33%. The pipeline required 17.7 versus a projected 165.1 person-hours (89.3% reduction). PICO-structured prompting generated 70 candidate statements; 58 of 112 finalized panel statements (51.8%) were AI-derived, and 85.7% were retained in the finalized set.
Conclusion
Google NotebookLM demonstrates feasibility as a primary extraction and synthesis tool in a multidisciplinary systematic review, with extractive incompleteness as the principal limitation and substantial time savings over conventional approaches. Its novel application to RAND/UCLA consensus statement generation extends AI-assisted evidence synthesis to clinical consensus generation workflow.