Hugging Face BlogAug 23, 2023, 12:00 AMimportant 85

使用 AutoGPTQ 與 transformers 讓大型語言模型更輕量化

Original: Making LLMs lighter with AutoGPTQ and transformers

This Hugging Face official blog post introduces a major update that integrates AutoGPTQ into the `transformers` and `optimum` libraries…

Hugging Face 正式將 AutoGPTQ 整合進 transformers 生態系，支援直接載入與運行 4-bit GPTQ 量化模型。此更新大幅降低了 LLM 的 GPU 記憶體門檻（如 70B 模型可在單張消費級 GPU 運行），並提供極佳的推理加速。開發者只需簡單修改程式碼即可啟用，並能無縫使用 Hub 上數千個現成的 GPTQ 模型。

This Hugging Face official blog post introduces a major update that integrates AutoGPTQ into the `transformers` and `optimum` libraries. GPTQ (Generalized Post-Training Quantization) is a precise and efficient 4-bit post-training quantization algorithm that can compress model weights to 4-bit — reducing memory footprint to roughly one-quarter — with virtually no loss in model accuracy (perplexity).

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Want the original English / full article?

Read on Hugging Face Blog →

llama open-source transformers optimum autogptq #quantization #gptq #llm-inference #gpu #open-source

Summaries are AI-generated; the original article is authoritative.