LLM-Based Web Data Collection for Research Dataset Creation

Thomas Berkane
Marie-Laure Charpignon
Maimuna Majumder

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Researchers across many fields rely on web data to gain new insights and validate methods. However, assembling accurate and comprehensive datasets typically demands manual review of numerous web pages to identify and record only those data points relevant to specific research objectives. The vast and scattered nature of online information makes this process time-consuming and prone to human error. To address these challenges, we present a human-in-the-loop framework that automates web-scale data collection end-to-end using large language models (LLMs). Given a textual description of a target dataset, our framework (1) automatically formulates search engine queries, (2) navigates the web to identify relevant web pages, (3) extracts the data points of interest, and (4) performs quality control to produce a structured, research-ready dataset. Users remain in the loop throughout, able to inspect and adjust the framework’s decisions to ensure alignment with their needs. We introduce techniques to mitigate both search engine bias and LLM hallucinations during data extraction. Experiments across three diverse data collection tasks show our framework significantly outperforms existing methods, while a user-centered case study demonstrates its practical utility. We open-source our code to help other researchers create custom datasets more efficiently.

Version published to 10.1101/2025.05.23.25328249 on medRxiv
May 25, 2025

QModel: A Time-Aware GitHub Mining Framework for Empirical Software Quality Studies

This article has 1 author:
1. Dmytro Polishchuk
This article has no evaluationsLatest version Jan 12, 2026
DiLLaB: Discussion Labeling with LLMs for Building Datasets

This article has 6 authors:
1. Ludimila Gonçalves
2. Márcia Lima
3. André Carvalho
4. Walter Nakamura
5. Igor Steinmacher
6. Tayana Conte
This article has no evaluationsLatest version Jan 28, 2026
Best Practices for Using Large Language Models at Scale

This article has 5 authors:
1. Bhargavee Kannikanti
2. Arjun Coimbatore Nagarasan
3. Alberto Rosas
4. Sriram Kothandaraman
5. Sravan Kumar Kannuri
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

QModel: A Time-Aware GitHub Mining Framework for Empirical Software Quality Studies

DiLLaB: Discussion Labeling with LLMs for Building Datasets

Best Practices for Using Large Language Models at Scale