The Challenge of Debiasing NLI Models: Why Hypothesis-Only Confidence is Insufficient
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Pre-trained models achieve high accuracy on NLI benchmarks but may rely on dataset artifacts rather than genuine reasoning. We investigate the ELECTRA-small (Clark et al., 2020) model’s performance on SNLI (Bowman et al., 2015), finding that a hypothesisonly baseline achieves 89.40% accuracy, only 0.29% below the baseline model’s 89.69%. This reveals severe hypothesis bias where the model makes predictions without considering premise-hypothesis relationships. Through qualitative analysis, we identify three primary error patterns: exact word overlap, semantic associations, and action overlap, all driven by hypothesis-only artifacts. We implement ensemble debiasing to address this bias, systematically exploring weighting strengths (α = 0.3, 0.5, 0.9). However, this approach degrades performance, increasing contradiction→ neutral errors from 231 to 240. Our analysis suggests that hypothesis-only confidence does not cleanly separate spurious shortcuts from legitimate linguistic signals, highlighting the challenge of debiasing NLI models