Large-Scale Hybrid Dialogue Data Processing for Transformer-Based Generative Chatbots Using Pretrained DeBERTa Embeddings
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper presents a scalable generative chatbot built on a Transformer-based encoder decoder architecture with pretrained DeBERTa embeddings. The model is trained on a hybrid large scale dialogue corpuscomprising over 120K question–answer pairs from real-world datasets (QuAC, DailyDialog) and curated synthetic dialogues generated by large language models. The architecture incorporates multi-head self-attention, positional encoding, residual connections, and a pre-norm strategy to enhance contextual understanding and generalization. Experimental results demonstrate a training accuracy of 99% and a BLEU score of 90.1%, highlighting the model’s effectiveness in processing and generating coherent responses from massive heterogeneous conversational datasets. This work contributes to big data analytics in NLP by integrating large-scale dataset curation with advanced Transformer-based modeling for conversational AI.