DiLLaB: Discussion Labeling with LLMs for Building Datasets
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
GitHub Discussions has emerged as a prominent platform for collaborative knowledge exchange in open-source software (OSS) development. However, as participation increases, the platform faces challenges common to Programming Community-based Question and Answering (PCQA) environments, particularly the proliferation of duplicate and semantically related questions, which can fragment knowledge and reduce retrieval effectiveness. Although in-context links are often shared between related threads, no labeled dataset or automated method currently exists for identifying semantic relatedness in this setting. We present \texttt{DiLLaB}, a framework that leverages these in-context links for high-precision candidate selection and uses prompt-based Large Language Models (LLMs) to label discussion pairs as \emph{related} or \emph{unrelated}, supporting a graph-based, leakage-free pipeline for dataset construction. We evaluate \texttt{DiLLaB} across seven distinct labeling configurations, spanning from basic prompting to zero- and few-shot strategies, with examples drawn from within or across repositories and using either full or summarized input, to analyze how different prompting setups affect labeling effectiveness. A feasibility study confirms the reliability of link-based signals, and evaluation across five repositories shows that zero-shot prompting achieves strong labeling performance (F1-score $ > 0.90$). The resulting dataset enables effective fine-tuning of a \texttt{RoBERTa} classifier, achieving a 48\% improvement in F1-score over a transfer learning baseline. Our results offer a scalable alternative to manual annotation, enables related post recommendation in GitHub Discussions, and lays a foundation for future research in discussion understanding within NLP for Software Engineering.