As large language models (LLMs) such as Llama 2 become more widely adopted, achieving efficient and cost-effective inference in production environments has…
As large language models (LLMs) and Retrieval-Augmented Generation (RAG) technology become increasingly widespread, embedding models have become an…
The Hugging Face official blog has announced a new "Inference for PROs" upgraded service for PRO subscribers (at $9 per month). This service is designed to…
As the parameter count of large language models (LLMs) has grown dramatically, running and fine-tuning these models on consumer-grade GPUs or limited hardware…
This case study examines how Fetch, a leading consumer rewards platform in the United States, leveraged the collaboration between Amazon SageMaker and Hugging…
This official Hugging Face blog post systematically maps out the complete ecosystem it has built around open-source large language models (LLMs). As…
This official Hugging Face blog post introduces how to use their hosted service "Inference Endpoints" to deploy large language models (LLMs). With the rapid…
The Falcon series of large language models (including Falcon-40B and Falcon-7B), developed by Abu Dhabi's Technology Innovation Institute (TII), have…
This article explains how to accelerate the deployment and inference of Hugging Face Transformers models using AWS Inferentia2 (Inf2 instances) — AWS's…
This article presents the results of a collaboration between Hugging Face and the Intel Habana team, focusing on how to leverage Intel's Habana Gaudi2 deep…
This case study from Mantis NLP details the core reasons behind their decision to migrate their machine learning model deployment workflow from traditional…
This article is the second installment of a Hugging Face series on accelerating PyTorch Transformer models on Intel's 4th-generation Xeon Scalable Processors…
This article is the first installment in a collaboration series between Hugging Face and Intel, focusing on how to accelerate PyTorch Transformer models using…
As the world's largest open-source AI model hub, Hugging Face not only provides model hosting but has also built a complete inference ecosystem. This article…
As Transformer models become increasingly prevalent in natural language processing (NLP) and computer vision (CV), efficiently deploying these large models in…
Hugging Face Inference Endpoints is a fully managed service designed for developers and enterprises, built to solve the pain points of deploying machine…
As the parameter counts of large language models (LLMs) grow exponentially, how to load and run these models on limited hardware has become a major pain point…
BLOOM is a massive open-source multilingual model with 176 billion parameters. Running BLOOM at FP16 precision requires at least 352 GB of video memory (VRAM)…
This article introduces the deep integration between Hugging Face and the bitsandbytes library, aimed at solving the enormous memory challenges posed by…
When deploying Transformer models in production, latency and throughput are typically the key factors determining the quality of the user experience. ONNX…
When deploying Transformer models in production, reducing inference latency and increasing throughput while keeping computational costs under control has…
With the rise of open-source large language models, deploying these models in cloud environments in a secure, stable, and scalable manner has become a critical…
This blog post is the second part of a technical guide co-authored by Hugging Face and Intel, designed to show developers how to push the inference performance…
In this technical blog post, the Hugging Face team reveals in detail how they achieved up to 100x speedup in inference for Transformer models for customers of…