Exploring Large Language Models' Responses to Moral Reasoning Dilemmas

Davin Nabizadeh
David Walker
Hyemin Han
Emily Laird

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

This study investigates how various large language models (LLMs) generate responses to moral reasoning dilemmas. It specifically examines LLM-generated responses using the Defining Issues Test (DIT-2) and the Intermediate Concepts Measure (ICM) for Educational Leaders. Using a neo-Kohlbergian approach to moral reasoning, the study evaluates responses from multiple LLM platforms: ChatGPT-3.5, ChatGPT-4, ChatGPT-4O, Grok Premium Plus, Claude 3.5 Sonnet, Gemini, and Gemini Advanced. For DIT-2, Claude learns to prioritize the highest post-conventional moral reasoning score and N2 score (P-score 72, N2 score 71.10), followed by Gemini Advanced (P-score 64, N2 score 60.31) and Gemini (P-score 58, N2 score 52.11). Other LLMs performed as follows: Grok (P-score 48, N2 score 47.98), ChatGPT-4O (P-score 44, N2 score 55.07), ChatGPT-4 (P-score 44, N2 score 46.53), and ChatGPT-3.5 (P-score 18, N2 score 36.20). For the ICM Educational Leaders version, Gemini Advanced had the highest total ICM score of 0.90, followed by Claude 3.5 Sonnet and Gemini (both 0.86), ChatGPT-4O and ChatGPT-4 (both 0.78), Grok (0.61), and ChatGPT-3.5 (0.32). The findings indicate that some LLMs can generate responses consistent with sophisticated moral reasoning patterns, producing scores comparable to or exceeding graduate-level human participants (whose P-scores typically range from 38.5 to 42.3) and provide a methodological framework consisting of standardized assessment protocols and comparative analysis techniques for larger-scale research to improve our understanding of AI's potential in moral reasoning.

Version published to 10.21203/rs.3.rs-6823916/v1 on Research Square
Jun 17, 2025

Listed in

Abstract

Article activity feed