Comprehensive Evaluation and Design Strategies for Japanese Counseling AIs Using Large Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study simultaneously evaluated three types of systems in Japanese counseling dialogues: counselor AIs (GPT-4-turbo in a zero-shot setting with a Structured Multi-Dialogue Prompt [SMDP] and Claude 3-Opus with SMDP), a client AI, and counseling-evaluation AIs (o3, Claude 3.7-Sonnet, and Gemini-2.5-pro). Introducing the SMDP significantly improved human experts’ global ratings based on the Motivational Interviewing Treatment Integrity Coding Manual 4.2.1, and no notable performance differences were detected among the major large language models. Although the evaluation AIs did not differ significantly from human raters on Cultivating Change Talk, they tended to overrate—particularly for Softening Sustain Talk and Comprehensive Evaluation—and exhibited model-specific grading biases. The client AI displayed limited emotional expression, indicating a need to enhance conversational naturalness. Finally, the study presents recommendations for boosting each AI’s performance through prompt engineering, retrieval-augmented generation, and fine-tuning.本研究は、日本語カウンセリング対話を対象に、カウンセラーAI(GPT-4-turboゼロショット/Structured Multi Dialogue Prompt (SMDP)、Claude-3-Opus SMDP)、クライアントAI、カウンセリング評価AI(o3, Claude-3.7-Sonnet, Gemini-2.5-pro)の三者を同時検証した。SMDP導入で、Motivational Interviewing Treatment Integrity Coding Manual 4.2.1に基づく人間の専門家によるグローバル評価が有意に向上し、主要Large Language Model間の性能差は認められなかった。評価AIは、Cultivating Change Talkでは人間の評価と有意な差はなかったが、特にSoftening Sustain TalkとComprehensive Evaluationについて過大評価する特徴や、モデル固有の採点の癖も判明した。クライアントAIは感情表出が乏しく自然さの改善が必要であることが示された。各AIについて、プロンプト設計とRetrieval Augmented Generationやファインチューニングによる性能向上のための提案が示された。