This article presents the results of a collaboration between Hugging Face and the Intel Habana team, focusing on how to leverage Intel's Habana Gaudi2 deep…
This case study from Mantis NLP details the core reasons behind their decision to migrate their machine learning model deployment workflow from traditional…
This article is the second installment of a Hugging Face series on accelerating PyTorch Transformer models on Intel's 4th-generation Xeon Scalable Processors…
This article is the first installment in a collaboration series between Hugging Face and Intel, focusing on how to accelerate PyTorch Transformer models using…
As the world's largest open-source AI model hub, Hugging Face not only provides model hosting but has also built a complete inference ecosystem. This article…
As Transformer models become increasingly prevalent in natural language processing (NLP) and computer vision (CV), efficiently deploying these large models in…
Hugging Face Inference Endpoints is a fully managed service designed for developers and enterprises, built to solve the pain points of deploying machine…
As the parameter counts of large language models (LLMs) grow exponentially, how to load and run these models on limited hardware has become a major pain point…
BLOOM is a massive open-source multilingual model with 176 billion parameters. Running BLOOM at FP16 precision requires at least 352 GB of video memory (VRAM)…
This article introduces the deep integration between Hugging Face and the bitsandbytes library, aimed at solving the enormous memory challenges posed by…
When deploying Transformer models in production, latency and throughput are typically the key factors determining the quality of the user experience. ONNX…
When deploying Transformer models in production, reducing inference latency and increasing throughput while keeping computational costs under control has…
With the rise of open-source large language models, deploying these models in cloud environments in a secure, stable, and scalable manner has become a critical…
This blog post is the second part of a technical guide co-authored by Hugging Face and Intel, designed to show developers how to push the inference performance…
In this technical blog post, the Hugging Face team reveals in detail how they achieved up to 100x speedup in inference for Transformer models for customers of…