LLM-Based Persona-Driven Text Data Augmentation

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

The rise of drug‑related crime in South Korea, especially via online messengers, reveals clear limits in keyword‑based or network‑tracking detection methods. To address this in a low‑resource setting, we propose a large‑language‑model (LLM) persona‑driven data‑augmentation framework. Buyer and seller personas replicate authentic linguistic patterns, slang and delivery practices, generating realistic, context‑rich dialogue. Using text‑embedding similarity, type–token ratio (TTR), perplexity, dialogue coherence and ROUGE, we show that 15 000 augmented dialogues closely mirror 87 real conversations while boosting lexical variety and contextual consistency. Results confirm that persona‑driven augmentation mitigates data scarcity and improves illicit‑dialogue detectors, offering a transferable strategy for other sensitive, low‑data domains such as voice phishing or fraudulent trade.

Article activity feed