Exploring Large Language Models' Responses to Moral Reasoning Dilemmas
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This study investigates how various large language models (LLMs) generate responses to moral reasoning dilemmas. It specifically examines LLM-generated responses using the Defining Issues Test (DIT-2) and the Intermediate Concepts Measure (ICM) for Educational Leaders. Using a neo-Kohlbergian approach to moral reasoning, the study evaluates responses from multiple LLM platforms: ChatGPT-3.5, ChatGPT-4, ChatGPT-4O, Grok Premium Plus, Claude 3.5 Sonnet, Gemini, and Gemini Advanced. For DIT-2, Claude learns to prioritize the highest post-conventional moral reasoning score and N2 score (P-score 72, N2 score 71.10), followed by Gemini Advanced (P-score 64, N2 score 60.31) and Gemini (P-score 58, N2 score 52.11). Other LLMs performed as follows: Grok (P-score 48, N2 score 47.98), ChatGPT-4O (P-score 44, N2 score 55.07), ChatGPT-4 (P-score 44, N2 score 46.53), and ChatGPT-3.5 (P-score 18, N2 score 36.20). For the ICM Educational Leaders version, Gemini Advanced had the highest total ICM score of 0.90, followed by Claude 3.5 Sonnet and Gemini (both 0.86), ChatGPT-4O and ChatGPT-4 (both 0.78), Grok (0.61), and ChatGPT-3.5 (0.32). The findings indicate that some LLMs can generate responses consistent with sophisticated moral reasoning patterns, producing scores comparable to or exceeding graduate-level human participants (whose P-scores typically range from 38.5 to 42.3) and provide a methodological framework consisting of standardized assessment protocols and comparative analysis techniques for larger-scale research to improve our understanding of AI's potential in moral reasoning.