Beyond Text Generation: Assessing Large Language Models' Ability to Follow Rules and Reason Logically

Zhiyong Han
Fortunato Battaglia
Kush Mansuria
Yoav Heyman
Stanley R. Terlecky

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

The growing interest in advanced large language models (LLMs) has sparked debate about how best to use them to enhance human productivities, including teaching and learning outcomes. However, a neglected issue in the debate concerning the applications of LLMs is whether these chatbots can follow strict rules and use reason to solve problems in novel contexts. To address this knowledge gap, we investigate the ability of five LLMs (ChatGPT-4o, Claude, Gemini, Meta AI, and Mistral) to solve and create word ladder puzzles to assess their rule-adherence and logical reasoning capabilities. Our two-phase methodology involves: 1) explicit instruction and word ladder puzzle-solving tasks to evaluate rule understanding, followed by 2) assessing LLMs' ability to create and solve word ladder puzzles while adhering to rules. Additionally, we test their ability to implicitly recognize and avoid HIPAA privacy rule violations in a real-world scenario. Our findings reveal that while LLMs can articulate the rules of word ladder puzzles and generate examples, they systematically fail to apply these rules and use logical reasoning in practice. Notably, all LLMs except Claude prioritized task completion (text writing) over ethical considerations in the HIPAA test. Our findings expose critical flaws in LLMs' rule-following and reasoning capabilities and therefore raise concerns about their reliability in tasks requiring strict rule-following and logical reasoning. We urge caution when integrating LLMs into critical fields, including education, and highlight the need for further research into their capabilities and limitations to ensure responsible AI development.

Version published to 10.21203/rs.3.rs-5084169/v1 on Research Square
Oct 9, 2024

"Make it Pop, but not Like That": A Taxonomy of Iterative Prompting Strategies for Refining AI-Generated Web Interfaces

This article has 1 author:
1. Zhenjiang Song
This article has no evaluationsLatest version Mar 8, 2026
Standardized Assessment of LLM English Proficiency

This article has 7 authors:
1. Shangchao Min
2. Shaonan Wang
3. Xinyu Gao
4. Hui Wang
5. Zhiling Jin
6. Chen Ling
7. Nai Ding
This article has no evaluationsLatest version Feb 19, 2026
AI for Survey Design: Generating and Evaluating Survey Questions with Large Language Models

This article has 3 authors:
1. Anna Fuchs
2. Anna-Carolina Haensch
3. Wiebke Weber
This article has no evaluationsLatest version Mar 12, 2026

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

"Make it Pop, but not Like That": A Taxonomy of Iterative Prompting Strategies for Refining AI-Generated Web Interfaces

Standardized Assessment of LLM English Proficiency

AI for Survey Design: Generating and Evaluating Survey Questions with Large Language Models