RLHF-Aligned Open LLMs: A Comparative Survey

Irtiqa Haider
Muhammad Shahnawaz

Read the full article

Listed in

This article is not in any list yet, why not save it to one of your lists.

Abstract

We survey recent open-weight large language models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and related AI-assisted methods, focusing on LLaMA 2 (7B/13B chat variants), LLaMA 3 (8B, 70B), Mistral 7B, Mixtral 8×7B (Sparse-MoE), Falcon 7B-Instruct, OpenAssistant-based models, Alpaca 7B, and Zephyr 7B. Closed models (GPT-4, Claude 3) are included for reference. For each model, we describe its alignment strategy (PPO, rejection sampling, DPO, RLAIF), reward modeling approach, architecture, and fine-tuning details (datasets, procedures, hyperparameters). We evaluate all models on multi-turn dialogue and factual benchmarks (MT-Bench, TruthfulQA) as well as safety/alignment metrics (helpfulness, harmlessness from HH-RLHF). Metrics include reward-model scores, helpfulness/harmlessness, factual accuracy, output diversity, and calibration. In addition to this survey, we present SAWYER, our five-stage open pipeline—red-teaming with AI critique, instruction fine-tuning, reward-model training, PPO alignment, and deployment—that we used to reproduce PPO/DPO tuning on a GPT-2 backbone. SAWYER’s PPO variant achieved mean reward scores of 2.4–2.5 (30% gain over supervised fine-tuning) while preserving diversity and fluency. Our results confirm that DPO-style distillation and AI-driven critique loops yield efficient alignment, and we highlight which strategies work best at each scale and task.

Version published to 10.20944/preprints202506.2381.v1
Jun 30, 2025

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

This article has 1 author:
1. Snehil Shrivastava
This article has no evaluationsLatest version Jun 16, 2025
A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

This article has 1 author:
1. Snehil Shrivastava
This article has no evaluationsLatest version Jun 16, 2025
A Comparative Survey of Large Language Models: Foundation, Instruction-Tuned, and Multimodal Variants

This article has 2 authors:
1. Owen Graham
2. Jim Balford
This article has no evaluationsLatest version Jun 13, 2025

Listed in

Abstract

Article activity feed

Related articles

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

A Comprehensive and Critical Survey of Large Language Model Inference and Feature Generation

A Comparative Survey of Large Language Models: Foundation, Instruction-Tuned, and Multimodal Variants