Experimental Evaluation of Machine Learning Models for Goal-oriented Customer Service Chatbot with Pipeline Architecture
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
\textcolor{blue}{Integrating machine learning (ML) into customer service chatbots has significantly enhanced their ability to understand and respond to user queries. However, without rigorous evaluation, such systems may yield artificial or inconsistent responses that affect user experience. In this study, we present an experimental evaluation approach tailored for goal-oriented customer service chatbots built using a pipeline architecture, focusing on three key components: Natural Language Understanding (NLU), Dialogue Management (DM), and Natural Language Generation (NLG). The proposed method is model-agnostic and emphasizes component-wise benchmarking through hyperparameter optimization and comparative analysis of candidate models. Specifically, we evaluate BERT and LSTM for the NLU component, DQN and DDQN for DM, and GPT-2 and DialoGPT for NLG. Experiments are conducted using the MultiWOZ dataset, with performance evaluated based on intent accuracy, dialogue success rate, and BLEU, METEOR, and ROUGE scores. Results show that BERT achieves superior intent detection, while LSTM excels in slot filling. DDQN outperforms DQN in task success, dialogue efficiency, and reward accumulation. GPT-2 surpasses DialoGPT in text generation quality. These findings not only highlight the strengths of individual models but also provide a reusable evaluation framework for optimizing chatbot performance across components, offering practical insights for future development in both research and real-world applications.}\thispagestyle{empty}