LLM-Based Persona-Driven Text Data Augmentation

Hyeon Seong Jeong
Han Kyeong Ko
Taehoon Kim

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The rise of drug‑related crime in South Korea, especially via online messengers, reveals clear limits in keyword‑based or network‑tracking detection methods. To address this in a low‑resource setting, we propose a large‑language‑model (LLM) persona‑driven data‑augmentation framework. Buyer and seller personas replicate authentic linguistic patterns, slang and delivery practices, generating realistic, context‑rich dialogue. Using text‑embedding similarity, type–token ratio (TTR), perplexity, dialogue coherence and ROUGE, we show that 15 000 augmented dialogues closely mirror 87 real conversations while boosting lexical variety and contextual consistency. Results confirm that persona‑driven augmentation mitigates data scarcity and improves illicit‑dialogue detectors, offering a transferable strategy for other sensitive, low‑data domains such as voice phishing or fraudulent trade.

Version published to 10.20944/preprints202504.1926.v1
Apr 23, 2025

Integrating Explainability for Sentiment Interpretation, Misclassification, and Bias Detection in Women-in-STEM Social Media

This article has 2 authors:
1. Shereen Fouad
2. Ezzaldin Alkooheji
This article has no evaluationsLatest version Jan 12, 2026
From Generation to Detection: Leveraging Empirically Derived Linguistic Hints for LLM-Based Fake News Detection

This article has 1 author:
1. Piyush Ghasiya
This article has no evaluationsLatest version Jan 28, 2026
Can large language models effectively reshape online implicit hate speech? An integrative modelling approach

This article has 6 authors:
1. Yinghui Huang
2. Qixia Feng
3. Hui Liu
4. Weiqing Li
5. Ying Ma
6. Zongkui Zhou
This article has no evaluationsLatest version Jan 14, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Integrating Explainability for Sentiment Interpretation, Misclassification, and Bias Detection in Women-in-STEM Social Media

From Generation to Detection: Leveraging Empirically Derived Linguistic Hints for LLM-Based Fake News Detection

Can large language models effectively reshape online implicit hate speech? An integrative modelling approach