Hugging Face BlogAug 8, 2023, 12:00 AMimportant 80

使用 DPO 微調 Llama 2：Hugging Face TRL 實作指南

Original: Fine-tune Llama 2 with DPO

### Background and Pain Points Traditional RLHF (Reinforcement Learning from Human Feedback), while achieving enormous success with models…

Hugging Face 釋出實用教學，介紹如何利用 TRL（Transformer Reinforcement Learning）庫中的 DPOTrainer，以「直接偏好優化（DPO）」技術微調 Llama 2。DPO 是一種替代傳統 RLHF 的新穎方法，它不需要訓練獨立的獎勵模型，也不需要複雜的 PPO 強化學習階段，僅需透過人類偏好數據（滿意與不滿意的回答對）即可直接優化模型，大幅降低了對齊（Alignment）的門檻與運算資源。

### Background and Pain Points

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on Hugging Face Blog →

llama trl transformers #dpo #fine-tuning #rlhf #alignment

Summaries are AI-generated; the original article is authoritative.