Comparative Analysis of Evaluation Methods for Generative Artificial Intelligence Systems and Development of Selection Algorithm

Read the full article See related articles

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

With the development of generative artificial intelligence and the active implementation of large language models in the ubiquitous field, a very important task arises, which requires an objective evaluation of the quality of such AI systems. Traditional machine learning metrics turn out to be inapplicable, since solution responses of LLM-based solutions demonstrate high variability in wording while maintaining semantic correctness. This paper analyzes existing approaches to evaluate the quality of systems built on the basis of generative AI, such as lexical methods, semantic embeddings, hybrid approaches based on LLM-as-a-Judge and natural language inference (NLI) methods. Particular attention is paid to the development of an algorithm for selecting the optimal evaluation strategy depending on various tasks, including the latency of evaluation, the correctness and interpretability of the results, as well as the stability and reproducibility of the obtained evaluation results. For comparison, the work presents the results of various evaluation methods using the example of analyzing the accuracy and relevance of a response from an AI system on a set of 500 test examples, demonstrating a correlation with expert assessments in the range from 0.67 to 0.92, depending on the chosen approach. The proposed algorithm can be used to build a suitable evaluation process for AI systems in various domains.

Article activity feed