Evaluation Benchmark Study for XAI Methods in Arabic Sentiment Analysis

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Explainable Artificial Intelligence (XAI) is essential for interpreting transformer-based models, yet the faithfulness and stability of explanation methods in non-English languages remain underexplored. This work presents a comprehensive benchmark of token-level XAI methods for Arabic sentiment analysis, evaluating LIME, SHAP, Integrated Gradients, DeepLIFT, and multiple ensemble variants across two transformer architectures (CAMeLBERT and AraBERT). We assess explanations using five established faithfulness metrics and complement score-based evaluation with rank-based aggregation via Borda count. We show that selective ensembling - particularly combining LIME and SHAP - yields a statistically significant but modest improvement over individual methods, improving ranking stability and robustness rather than absolute explanation quality. Bootstrap confidence intervals and paired Wilcoxon tests confirm the consistency of this effect. Our analysis further highlights persistent limitations in faithfulness metrics, including low correlation with Leave-One-Out perturbations, underscoring ongoing challenges in XAI evaluation. Overall, this study provides a rigorous, reproducible benchmark and practical guidance for explanation method selection in Arabic NLP.

Article activity feed