Evaluating ChatGPT’s Semantic Alignment with Community Answers: A Topic-Aware Analysis Using BERTScore and BERTopic
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study evaluates the semantic alignment of ChatGPT’s responses with human-selected best answers in an open-domain question answering (QA) setting, using data from the Yahoo! Answers platform. Unlike prior research focused on domain-specific or educational datasets, this work investigates ChatGPT’s general-purpose QA capabilities across a diverse topical landscape. We apply BERTopic to extract latent themes from 500 full-question samples and use BERTScore metrics (precision, recall, F1) to quantify semantic similarity between ChatGPT-generated answers and top-rated community responses. Results show that ChatGPT achieves a strong average F1 score of 0.827, indicating high overall alignment with human judgments. Nonetheless, topic level analysis revealed important performance differences: the model was strong when asked factual and encyclopedia type questions, but was less capable of responding to subjective, ambiguous, or advice related questions. In this research, we proposed a topic-sensitive evaluation framework, that can be used to assess large language models in open-domain QA situations, that will add to the understanding of current benchmarking, interpreting performance, and building successful conversational AI systems.