Beyond Text Generation: Assessing Large Language Models' Ability to Follow Rules and Reason Logically
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
The growing interest in advanced large language models (LLMs) has sparked debate about how best to use them to enhance human productivities, including teaching and learning outcomes. However, a neglected issue in the debate concerning the applications of LLMs is whether these chatbots can follow strict rules and use reason to solve problems in novel contexts. To address this knowledge gap, we investigate the ability of five LLMs (ChatGPT-4o, Claude, Gemini, Meta AI, and Mistral) to solve and create word ladder puzzles to assess their rule-adherence and logical reasoning capabilities. Our two-phase methodology involves: 1) explicit instruction and word ladder puzzle-solving tasks to evaluate rule understanding, followed by 2) assessing LLMs' ability to create and solve word ladder puzzles while adhering to rules. Additionally, we test their ability to implicitly recognize and avoid HIPAA privacy rule violations in a real-world scenario. Our findings reveal that while LLMs can articulate the rules of word ladder puzzles and generate examples, they systematically fail to apply these rules and use logical reasoning in practice. Notably, all LLMs except Claude prioritized task completion (text writing) over ethical considerations in the HIPAA test. Our findings expose critical flaws in LLMs' rule-following and reasoning capabilities and therefore raise concerns about their reliability in tasks requiring strict rule-following and logical reasoning. We urge caution when integrating LLMs into critical fields, including education, and highlight the need for further research into their capabilities and limitations to ensure responsible AI development.