Evaluation of Large Language Models: Review of Metrics, Applications, and Methodologies

Satyadhar Joshi

Read the full article

Discuss this preprint

Start a discussion What are Sciety discussions?

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

Large Language Models (LLMs) have revolutionized various domains, including finance, medicine, and education. This review paper provides a comprehensive survey of the key metrics and methodologies employed to evaluate LLMs. We discuss the importance of evaluation, explore a wide range of metrics covering aspects such as accuracy, coherence, relevance, and safety, and examine different evaluation frameworks and techniques. We also address the challenges in LLM evaluation and highlight best practices for ensuring reliable and trustworthy AI systems. This survey draws upon a wide range of recent research and practical insights to offer a holistic view of the current state of LLM evaluation.We surveyed a comprehensive evaluation framework integrating quantitative metrics like entropy-based stability measures and domain-specific scoring systems for medical diagnostics and financial analysis, while addressing persistent challenges including hallucination rates (28\% of outputs from current research) and geographical biases in model responses. The study proposes standardized benchmarks and hybrid human-AI evaluation pipelines to enhance reliability, supported by algorithmic innovations in training protocols and RAG architectures. Our findings underscore the necessity of robust, domain-adapted evaluation methodologies to ensure the safe deployment of LLMs in high-stakes applications. Through systematic analysis of 70+ studies, this paper revisits that while LLMs achieve near-human performance in structured tasks like certifications exams, they exhibit critical limitations in open-ended reasoning and output consistency. Our analysis covers foundational concepts in prompt engineering, evaluation methodologies from industry and academia, and practical tools for implementing these assessments. The paper examines key challenges in LLM evaluation, including bias detection, hallucination measurement, and context retention, while proposing standardized approaches for comparative analysis. We demonstrate how different evaluation frameworks can be applied across domains such as technical documentation, creative writing, and factual question answering. The findings provide practitioners with a structured approach to selecting appropriate evaluation metrics based on use case requirements, model characteristics, and desired outcomes.

Version published to 10.20944/preprints202504.0369.v2
Apr 7, 2025
Version published to 10.20944/preprints202504.0369.v1
Apr 4, 2025

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

This article has 5 authors:
1. Deepshikha Bhati
2. Fnu Neha
3. Devi Sri Bandaru
4. Matthew Weber
5. Ishan Dilipbhai Gajera
This article has no evaluationsLatest version Jan 15, 2026
LLM Aspect Prediction: Reviewing Academic Papers from Different Aspects with Large Language Model

This article has 3 authors:
1. Zihao Hu
2. Fumiyo Fukumoto
3. Dongjin Yu
This article has no evaluationsLatest version Dec 11, 2025
Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance

This article has 3 authors:
1. Nimet Aksoy
2. Zekeriya Anıl Güven
3. Murat Osman Ünalır
This article has no evaluationsLatest version Dec 12, 2025

Discuss this preprint

Listed in

Abstract

Article activity feed

Related articles

Large Language Models: A Survey of Architectures, Training Paradigms, and Alignment Methods

LLM Aspect Prediction: Reviewing Academic Papers from Different Aspects with Large Language Model

Understanding the Impact of Dataset Characteristics on RAG-based Multi-hop QA Performance