The deployment of large language models (LLMs) has long faced a dual bottleneck of VRAM capacity and memory bandwidth. Microsoft previously introduced the…
GGML is a lightweight, zero-dependency C/C++ tensor library developed by Georgi Gerganov. It was originally designed to enable efficient local inference of the…
### Background and Challenges As generative AI technology evolves, image and video generation models are increasingly transitioning from traditional UNet…
Meta's Llama 3.1 represents a major milestone in the open-source AI landscape. The most notable model is the 405B (405 billion parameter) version — the first…
Following Apple's major Core ML updates announced at WWDC 24, Hugging Face published a practical guide detailing how to convert the popular open-source large…
During the inference process of large language models (LLMs), the self-attention mechanism needs to store the Key and Value vectors of historical tokens (i.e…
As RAG (Retrieval-Augmented Generation) and semantic search have become widespread, the maintenance costs of vector databases — especially RAM overhead — have…
This technical blog post from Hugging Face details how to locally deploy and run Microsoft's lightweight Phi-2 language model (2.7 billion parameters) on a…
Hugging Face has officially introduced Quanto, a brand-new quantization library designed for PyTorch, which has been integrated as a backend into the Hugging…
This Hugging Face blog post explores in detail how to use the `Optimum Intel` library to accelerate inference for the StarCoder code-generation model on Intel…
Hugging Face announced the launch of a new open-source library called "Optimum-NVIDIA," the result of a deep collaboration with NVIDIA, aimed at seamlessly…
This technical guide from Hugging Face systematically introduces the core strategies for deploying and optimizing large language models (LLMs) in production…
As the parameter count of large language models (LLMs) has grown dramatically, running and fine-tuning these models on consumer-grade GPUs or limited hardware…
This Hugging Face official blog post introduces a major update that integrates AutoGPTQ into the `transformers` and `optimum` libraries. GPTQ (Generalized…
This blog post, co-authored by Hugging Face and Zama — a cryptography company specializing in Fully Homomorphic Encryption (FHE) — explores how to address a…
Since the release of Stable Diffusion XL (SDXL), its exceptional image generation quality has attracted widespread attention. However, its massive 1.3 billion…
In the era of rapidly advancing generative AI, deploying large deep learning models to users' personal devices (edge devices) has long been a major challenge…
In the current boom of generative AI, image generation models like Stable Diffusion have become widely popular thanks to their remarkable capabilities…
This official Hugging Face blog post introduces a deep integration with the `bitsandbytes` library, formally adding 4-bit quantization support to…
This article introduces the latest outcome of a collaboration between Hugging Face and Intel: "Q8-Chat," a project designed to demonstrate how to efficiently…
### Core Background and Challenges DeepFloyd IF is an advanced text-to-image model released by DeepFloyd, a research lab under Stability AI. Unlike the…
This technical blog post from Hugging Face provides a detailed guide on optimizing and accelerating Stable Diffusion model inference on Intel CPUs…
This technical blog post from Hugging Face introduces how to combine TRL (Transformer Reinforcement Learning) and PEFT (Parameter-Efficient Fine-Tuning)…
This article is the second installment of a Hugging Face series on accelerating PyTorch Transformer models on Intel's 4th-generation Xeon Scalable Processors…
"Document AI" is a key driver of enterprise digital transformation in recent years, aimed at automating the processing of unstructured documents such as…
As Transformer models become increasingly prevalent in natural language processing (NLP) and computer vision (CV), efficiently deploying these large models in…
This technical blog post from Hugging Face documents in detail the practical process of optimizing inference for BLOOM, the open-source multilingual large…
This article introduces the deep integration between Hugging Face and the bitsandbytes library, aimed at solving the enormous memory challenges posed by…
When deploying Transformer models in production, reducing inference latency and increasing throughput while keeping computational costs under control has…
This case study focuses on the performance of "Hugging Face Infinity" — Hugging Face's high-performance inference container solution — on modern CPUs…