DiLLaB: Discussion Labeling with LLMs for Building Datasets

Ludimila Gonçalves
Márcia Lima
André Carvalho
Walter Nakamura
Igor Steinmacher
Tayana Conte

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

GitHub Discussions has emerged as a prominent platform for collaborative knowledge exchange in open-source software (OSS) development. However, as participation increases, the platform faces challenges common to Programming Community-based Question and Answering (PCQA) environments, particularly the proliferation of duplicate and semantically related questions, which can fragment knowledge and reduce retrieval effectiveness. Although in-context links are often shared between related threads, no labeled dataset or automated method currently exists for identifying semantic relatedness in this setting. We present \texttt{DiLLaB}, a framework that leverages these in-context links for high-precision candidate selection and uses prompt-based Large Language Models (LLMs) to label discussion pairs as \emph{related} or \emph{unrelated}, supporting a graph-based, leakage-free pipeline for dataset construction. We evaluate \texttt{DiLLaB} across seven distinct labeling configurations, spanning from basic prompting to zero- and few-shot strategies, with examples drawn from within or across repositories and using either full or summarized input, to analyze how different prompting setups affect labeling effectiveness. A feasibility study confirms the reliability of link-based signals, and evaluation across five repositories shows that zero-shot prompting achieves strong labeling performance (F1-score $ > 0.90$). The resulting dataset enables effective fine-tuning of a \texttt{RoBERTa} classifier, achieving a 48\% improvement in F1-score over a transfer learning baseline. Our results offer a scalable alternative to manual annotation, enables related post recommendation in GitHub Discussions, and lays a foundation for future research in discussion understanding within NLP for Software Engineering.

Version published to 10.21203/rs.3.rs-8620355/v1 on Research Square
Jan 28, 2026

PruneBERT: Context-Aware Sentence Classification through Statistical Relevance Pruning

This article has 5 authors:
1. Raghav Kaushik R
2. Jeganathan L
3. Janaki Meena M
4. Ummity Srinivasa Rao
5. Jayaram Balabaskaran
This article has no evaluationsLatest version Feb 6, 2026
Confidence-Aware Dual-Graph Calibration for Noise-Robust Relation Extraction in Low-Resource Scenarios

This article has 4 authors:
1. Jiafeng Suo
2. Jian Fang
3. He wang
4. Yang Xiao
This article has no evaluationsLatest version Feb 24, 2026
Knowledge and Context Compression via Question Generation

This article has 6 authors:
1. Alex Anvi Eponon
2. Moein Shahiki-Tash
3. Abdullah -
4. Luis Ramos
5. Christian Maldonado-Sifuentes
6. Ildar Batyrshin
This article has no evaluationsLatest version Jan 27, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

PruneBERT: Context-Aware Sentence Classification through Statistical Relevance Pruning

Confidence-Aware Dual-Graph Calibration for Noise-Robust Relation Extraction in Low-Resource Scenarios

Knowledge and Context Compression via Question Generation