As generative AI advances rapidly, deploying massive models to resource-constrained edge devices — such as smartphones, smart hardware, and AI PCs — has become…
As diffusion models (such as Flux.1 and Stable Diffusion 3) continue to grow in parameter count — often reaching tens of billions or even hundreds of billions…
As large language models (LLMs) and vision language models (VLMs) continue to scale up, running these models on limited hardware resources — such as…
As the hardware performance of mobile devices continues to improve, "edge inference" — running large language models (LLMs) directly on smartphones — has…
This official Hugging Face blog post takes an in-depth look at the current state of open-source video generation models within the Diffusers ecosystem. As…
Hugging Face recently published an in-depth analysis of its well-known Open LLM Leaderboard, examining the carbon dioxide (CO₂) emissions generated during…
This article provides a detailed look at how to use Hugging Face's `optimum-intel` library and Intel's OpenVINO GenAI toolkit to optimize and deploy generative…
The deployment of large language models (LLMs) has long faced a dual bottleneck of VRAM capacity and memory bandwidth. Microsoft previously introduced the…
GGML is a lightweight, zero-dependency C/C++ tensor library developed by Georgi Gerganov. It was originally designed to enable efficient local inference of the…
### Background and Challenges As generative AI technology evolves, image and video generation models are increasingly transitioning from traditional UNet…
Meta's Llama 3.1 represents a major milestone in the open-source AI landscape. The most notable model is the 405B (405 billion parameter) version — the first…
Following Apple's major Core ML updates announced at WWDC 24, Hugging Face published a practical guide detailing how to convert the popular open-source large…
During the inference process of large language models (LLMs), the self-attention mechanism needs to store the Key and Value vectors of historical tokens (i.e…
As RAG (Retrieval-Augmented Generation) and semantic search have become widespread, the maintenance costs of vector databases — especially RAM overhead — have…
This technical blog post from Hugging Face details how to locally deploy and run Microsoft's lightweight Phi-2 language model (2.7 billion parameters) on a…
Hugging Face has officially introduced Quanto, a brand-new quantization library designed for PyTorch, which has been integrated as a backend into the Hugging…
This Hugging Face blog post explores in detail how to use the `Optimum Intel` library to accelerate inference for the StarCoder code-generation model on Intel…
Hugging Face announced the launch of a new open-source library called "Optimum-NVIDIA," the result of a deep collaboration with NVIDIA, aimed at seamlessly…
This technical guide from Hugging Face systematically introduces the core strategies for deploying and optimizing large language models (LLMs) in production…
As the parameter count of large language models (LLMs) has grown dramatically, running and fine-tuning these models on consumer-grade GPUs or limited hardware…
This Hugging Face official blog post introduces a major update that integrates AutoGPTQ into the `transformers` and `optimum` libraries. GPTQ (Generalized…
This blog post, co-authored by Hugging Face and Zama — a cryptography company specializing in Fully Homomorphic Encryption (FHE) — explores how to address a…
Since the release of Stable Diffusion XL (SDXL), its exceptional image generation quality has attracted widespread attention. However, its massive 1.3 billion…
In the era of rapidly advancing generative AI, deploying large deep learning models to users' personal devices (edge devices) has long been a major challenge…
In the current boom of generative AI, image generation models like Stable Diffusion have become widely popular thanks to their remarkable capabilities…
This official Hugging Face blog post introduces a deep integration with the `bitsandbytes` library, formally adding 4-bit quantization support to…
This article introduces the latest outcome of a collaboration between Hugging Face and Intel: "Q8-Chat," a project designed to demonstrate how to efficiently…
### Core Background and Challenges DeepFloyd IF is an advanced text-to-image model released by DeepFloyd, a research lab under Stability AI. Unlike the…
This technical blog post from Hugging Face provides a detailed guide on optimizing and accelerating Stable Diffusion model inference on Intel CPUs…
This technical blog post from Hugging Face introduces how to combine TRL (Transformer Reinforcement Learning) and PEFT (Parameter-Efficient Fine-Tuning)…