DiLLaB: Discussion Labeling with LLMs for Building Datasets

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

GitHub Discussions has emerged as a prominent platform for collaborative knowledge exchange in open-source software (OSS) development. However, as participation increases, the platform faces challenges common to Programming Community-based Question and Answering (PCQA) environments, particularly the proliferation of duplicate and semantically related questions, which can fragment knowledge and reduce retrieval effectiveness. Although in-context links are often shared between related threads, no labeled dataset or automated method currently exists for identifying semantic relatedness in this setting. We present \texttt{DiLLaB}, a framework that leverages these in-context links for high-precision candidate selection and uses prompt-based Large Language Models (LLMs) to label discussion pairs as \emph{related} or \emph{unrelated}, supporting a graph-based, leakage-free pipeline for dataset construction. We evaluate \texttt{DiLLaB} across seven distinct labeling configurations, spanning from basic prompting to zero- and few-shot strategies, with examples drawn from within or across repositories and using either full or summarized input, to analyze how different prompting setups affect labeling effectiveness. A feasibility study confirms the reliability of link-based signals, and evaluation across five repositories shows that zero-shot prompting achieves strong labeling performance (F1-score $ > 0.90$). The resulting dataset enables effective fine-tuning of a \texttt{RoBERTa} classifier, achieving a 48\% improvement in F1-score over a transfer learning baseline. Our results offer a scalable alternative to manual annotation, enables related post recommendation in GitHub Discussions, and lays a foundation for future research in discussion understanding within NLP for Software Engineering.

Article activity feed