Accelerating Machine Learning in Healthcare: Addressing the Labelling Bottleneck
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Timely detection of postoperative arrhythmias after cardiac surgery is essential for preventing hemodynamic compromise. Machine learning models remain constrained by the scarcity of labeled datasets that reflect real-world monitoring practices. Existing approaches rely on adult 12-lead electrocardiograms which are rarely used continuously in pediatric ICUs and fail to capture age-dependent waveform variability. We present a clinically integrated labeling framework designed to overcome this bottleneck. Leveraging a physiologic waveform repository comprising over 1.6 million hours of unlabeled, continuous lead II ECG from more than 9,000 pediatric patients, we implemented a multi-phase strategy combining retrospective data mining, clinician-in-the-loop annotation, and active learning techniques, including uncertainty sampling and embedding-based retrieval.Initial labeling from MUSE (GE Healthcare) studies and ICU observations produced 154.9 hours of annotated ECG waveforms but required extensive clinician effort and yielded limited inter-patient variability. These two strategies provided sufficient coverage to train a preliminary classifier, enabling representation-aware sampling that dramatically improved efficiency. Embedding-guided retrieval achieved a precision of 60.2% for junctional arrhythmias and increased patient diversity compared to clinician in the loop-based labeling, while reducing annotation time per positive segment. Using this approach, we curated 189.2 hours of expert-labeled ECG from 1,447 unique patients, enriched for junctional arrhythmias, the primary modeling target.This work addresses a critical barrier to pediatric machine learning development and establishes a scalable methodology for creating clinically relevant datasets at scale, paving the way for real-time, clinician-augmented decision support systems in pediatric critical care.