Hugging Face 推出 RLOO 演算法：降低記憶體消耗，讓強化學習重回 RLHF 主流

Original: Putting RL back in RLHF

In recent years, methods such as Direct Preference Optimization (DPO) have become mainstream for large language model (LLM) alignment, as…

近年 DPO 等直接對齊方法因免去強化學習（RL）的複雜度而大受歡迎，但線上 RL 仍有其獨特優勢。Hugging Face 發表部落格介紹在 TRL 庫中實現的 RLOO（REINFORCE Leave-One-Out）演算法。RLOO 透過生成多個樣本並計算「留一法」基準值來降低變異數，不僅免去了 PPO 龐大的 Critic 網路、節省顯存，還能達到與 PPO 相當甚至更好的對齊效果，讓線上 RL 重新成為實用選擇。

In recent years, methods such as Direct Preference Optimization (DPO) have become mainstream for large language model (LLM) alignment, as they eliminate the need to train a reward model and perform complex online reinforcement learning (RL). However, traditional online RL (such as PPO) still holds unique advantages in terms of exploration capability and preventing model collapse. To lower the barrier to online RL, Hugging Face has introduced the RLOO (REINFORCE Leave-One-Out) algorithm into its TRL (Transformer Reinforcement Learning) library.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Summaries are AI-generated; the original article is authoritative.