Interpretability and Trust in Large Language and Agentic Models: A Survey of Methods, Metrics, and Applications
Discuss this preprint
Start a discussion What are Sciety discussions?Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
Large-scale language models (LLMs) and the agentic systems that embed them are being deployed across finance, healthcare, law and other high-stakes domains. Their emergence has intensified concerns about interpretability—the ability to understand how the models produce their outputs—and trust—the confidence that the models behave reliably and ethically. The opaque internal representations of deep neural models mean that decisions may be unpredictable or unfair, undermining public confidence and limiting adoption. This paper surveys the state of interpretability and trust for both standalone LLMs and agentic AI systems, synthesising methodological advances, evaluation metrics and real-world applications. We organise methods into feature-attribution techniques such as LIME and SHAP, examplebased and counterfactual explanations, process-level and mechanistic interpretability, and system-level approaches tailored to agentic multiagent systems. We then review evaluation frameworks that measure explanation quality, fairness, robustness and other trust dimensions, including recent benchmarks like TrustLLM and psychometric scales for human–LLM trust. We discuss how interpretability interacts with safety, robustness, privacy and ethics, and how adaptive monitoring and balanced evaluation frameworks can promote trustworthy deployment. Finally, we highlight open research challenges in ensuring that increasingly autonomous agentic systems remain transparent, accountable and aligned with human values.