From Testing to Evaluation of NLP and LLM Systems: An Analysis of Researchers and Practitioners Perspectives through Systematic Literature Review and Developers’ Community Platforms Mining

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

Natural Language Processing (NLP) systems and Large Language Models (LLMs) are core components of applications driven by Artificial Intelligence, today widely spread in many fields including healthcare, finance, legal services. The evaluation of quality attributes of NLP/LLM-based systems is currently of great interest of researchers as well as of professionals. This study presents an analysis of the evaluation of quality attributes of NLP/LLM-based systems in the perspectives of both researchers and practitioners. The former is based on a systematic review of the scientific literature. The latter is based on data mined from professional platforms (StackOverflow, DataScience Stack-Exchange, AI StackExchange) used by developers’ communities to share knowledge and technical solutions. The systematic literature review features a quantitative analysis of: 1) the quality attributes that researchers target in their NLP/LLMs evaluation studies; 2) the tasks, datasets and models used, and 3) the evaluation methods employed in research studies. The mining of discussion data from professional platforms features a trend analysis of the significance of quality attributes and of the difficulty of their evaluation in the developers’ perspective. Overall, the comparison between researchers’ and practitioners’ perspectives provides insights on how challenges related to the evaluation of NLP/LLM quality attributes are addressed, reveals several interesting differences, and draws useful implications for both.

Article activity feed