Hugging Face BlogJan 19, 2021, 12:00 AMimportant 80

透過 DeepSpeed 與 FairScale 的 ZeRO 技術，讓 Hugging Face 訓練容納更多參數且速度更快

Original: Fit More and Train Faster With ZeRO via DeepSpeed and FairScale

As the parameter scale of Transformer models (such as GPT, T5, etc.) grows exponentially, deep learning faces a severe "Memory Wall"…

Hugging Face 宣布在其 Trainer 中整合 Microsoft DeepSpeed 與 Facebook FairScale 的 ZeRO（零冗餘優化器）技術。這項技術透過將優化器狀態、梯度和模型參數分片到多個 GPU 上，顯著降低顯存佔用。開發者現在可以輕鬆在有限的硬體資源下，訓練原本無法容納的超大型 Transformer 模型，並大幅提升訓練效率。

As the parameter scale of Transformer models (such as GPT, T5, etc.) grows exponentially, deep learning faces a severe "Memory Wall" challenge. With limited GPU VRAM, training models with tens of billions or even hundreds of billions of parameters becomes extremely difficult — in addition to the model parameters themselves, optimizer states (e.g., for Adam) and gradients also consume enormous amounts of memory.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on Hugging Face Blog →

open-source huggingface deepspeed pytorch #distributed-training #zero #deepspeed #fairscale #llm-training

Summaries are AI-generated; the original article is authoritative.