RLHF-Aligned Open LLMs: A Comparative Survey
Listed in
This article is not in any list yet, why not save it to one of your lists.Abstract
We survey recent open-weight large language models (LLMs) fine-tuned via Reinforcement Learning from Human Feedback (RLHF) and related AI-assisted methods, focusing on LLaMA 2 (7B/13B chat variants), LLaMA 3 (8B, 70B), Mistral 7B, Mixtral 8×7B (Sparse-MoE), Falcon 7B-Instruct, OpenAssistant-based models, Alpaca 7B, and Zephyr 7B. Closed models (GPT-4, Claude 3) are included for reference. For each model, we describe its alignment strategy (PPO, rejection sampling, DPO, RLAIF), reward modeling approach, architecture, and fine-tuning details (datasets, procedures, hyperparameters). We evaluate all models on multi-turn dialogue and factual benchmarks (MT-Bench, TruthfulQA) as well as safety/alignment metrics (helpfulness, harmlessness from HH-RLHF). Metrics include reward-model scores, helpfulness/harmlessness, factual accuracy, output diversity, and calibration. In addition to this survey, we present SAWYER, our five-stage open pipeline—red-teaming with AI critique, instruction fine-tuning, reward-model training, PPO alignment, and deployment—that we used to reproduce PPO/DPO tuning on a GPT-2 backbone. SAWYER’s PPO variant achieved mean reward scores of 2.4–2.5 (30% gain over supervised fine-tuning) while preserving diversity and fluency. Our results confirm that DPO-style distillation and AI-driven critique loops yield efficient alignment, and we highlight which strategies work best at each scale and task.