Mind the Gap: From Plausible to Valid Self-Explanations in Large Language Models
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
This paper investigates the reliability of explanations generated by large language models (LLMs) when prompted to explain their previous output. We evaluate two kinds of such self-explanations (SE) – extractive and counterfactual – using state-of-the-art LLMs (1B to 70B parameters) on two different classification tasks (objective and subjective). In line with Agarwal et al. (2024), our findings indicate a gap between perceived and actual model reasoning: while SE largely correlate with human judgment (i.e. are plausible), they do not fully and accurately follow the model’s decision process (i.e. are not faithful ). Additionally, we show that counterfactual SE are not even necessarily valid in the sense of actually changing the LLM’s prediction. Our results suggest that extractive SE providethe LLM’s “guess” at an explanation based on training data. Conversely, counterfactual SE can help understand the LLM’s reasoning: We show that the issueof validity can be resolved by sampling counterfactual candidates at high tem-perature – followed by a validity check – and introducing a formula to estimatethe number of tries needed to generate valid explanations. This simple methodproduces plausible and valid explanations that offer a faster alternative to SHAP.