RLHF-Aligned Open LLMs: A Comparative Survey

Read the full article See related articles

Listed in

This article is not in any list yet, why not save it to one of your lists.
Log in to save this article

Abstract

We survey recent open-weight large language models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and related AI-assisted methods, focusing on LLaMA 2 (7B/13B chat variants), LLaMA 3 (8B, 70B), Mistral 7B, Mixtral 8×7B (Sparse-MoE), Falcon 7B-Instruct, OpenAssistant-based models, Alpaca 7B, and Zephyr 7B. Closed models (GPT-4, Claude 3) are included for reference. For each model, we describe its alignment strategy (PPO, rejection sampling, DPO, RLAIF), reward modeling approach, architecture, and fine-tuning details (datasets, procedures, hyperparameters). We evaluate all models on multi-turn dialogue and factual benchmarks (MT-Bench, TruthfulQA) as well as safety/alignment metrics (helpfulness, harmlessness from HH-RLHF). Metrics include reward-model scores, helpfulness/harmlessness, factual accuracy, output diversity, and calibration. In addition to this survey, we present SAWYER, our five-stage open pipeline—red-teaming with AI critique, instruction fine-tuning, reward-model training, PPO alignment, and deployment—that we used to reproduce PPO/DPO tuning on a GPT-2 backbone. SAWYER’s PPO variant achieved mean reward scores of 2.4–2.5 (30% gain over supervised fine-tuning) while preserving diversity and fluency. Our results confirm that DPO-style distillation and AI-driven critique loops yield efficient alignment, and we highlight which strategies work best at each scale and task.

Article activity feed