Hugging Face's official blog has announced that its widely adopted open-source large model inference framework, Text Generation Inference (TGI), now officially…
Hugging Face recently announced a major upgrade to its hosted model deployment service, "Inference Endpoints," introducing a brand-new and far more modern…
On February 18, 2025, Hugging Face announced the addition of three new partners to its serverless inference ecosystem: Hyperbolic, Nebius AI Studio, and Novita…
On February 14, 2025, Hugging Face — the leading open-source AI community — officially announced the integration of high-performance AI inference platform…
In the current era of generative AI sweeping the globe, many developers habitually feed all tasks — including simple text classification, sentiment analysis…
As DeepSeek-R1 swept through the AI landscape on the strength of its powerful reasoning capabilities, how to safely and efficiently deploy and fine-tune these…
Hugging Face has officially launched the "Inference Providers" feature on the Hugging Face Hub — a major update designed to address the pain points developers…
Text Generation Inference (TGI), Hugging Face's open-source LLM inference and deployment framework, has received a major architectural update, officially…
The AI deployment platform Replicate has announced the official availability of NVIDIA L40S GPU compute on its platform. This update aims to provide developers…
The deployment of large language models (LLMs) has long faced a dual bottleneck of VRAM capacity and memory bandwidth. Microsoft previously introduced the…
GGML is a lightweight, zero-dependency C/C++ tensor library developed by Georgi Gerganov. It was originally designed to enable efficient local inference of the…
Hugging Face and NVIDIA announced a major partnership in late July 2024, officially launching a serverless inference service powered by NVIDIA NIM (NVIDIA…
The Hugging Face official blog has introduced a major update to its open-source text generation inference engine, Text Generation Inference (TGI): the…
Hugging Face announced a deep partnership with Google Cloud, officially integrating Google Cloud TPUs (Tensor Processing Units) into the Hugging Face platform…
The official blog of Replicate, the popular AI model hosting and deployment platform, has announced that NVIDIA H100 Tensor Core GPUs will soon be officially…
This official Hugging Face blog post takes an in-depth look at how to benchmark Text Generation Inference (TGI), Hugging Face's open-source LLM inference and…
Hugging Face has announced official support for AWS Inferentia2 (Inf2) instances within its hosted Inference Endpoints service. This update gives developers…
As enterprise demand for Retrieval-Augmented Generation (RAG) technology surges, how to maintain high performance while controlling hardware costs has become…
This article introduces how to run privacy-preserving inference based on Fully Homomorphic Encryption (FHE) on Hugging Face Endpoints. In traditional…
Hugging Face announced the launch of a new open-source library called "Optimum-NVIDIA," the result of a deep collaboration with NVIDIA, aimed at seamlessly…
In real-world generative AI applications, fine-tuning for specific tasks or clients is a common requirement. However, deploying a full base model for every…
As large language models (LLMs) such as Llama 2 become more widely adopted, achieving efficient and cost-effective inference in production environments has…
As large language models (LLMs) and Retrieval-Augmented Generation (RAG) technology become increasingly widespread, embedding models have become an…
The Hugging Face official blog has announced a new "Inference for PROs" upgraded service for PRO subscribers (at $9 per month). This service is designed to…
As the parameter count of large language models (LLMs) has grown dramatically, running and fine-tuning these models on consumer-grade GPUs or limited hardware…
This case study examines how Fetch, a leading consumer rewards platform in the United States, leveraged the collaboration between Amazon SageMaker and Hugging…
This official Hugging Face blog post systematically maps out the complete ecosystem it has built around open-source large language models (LLMs). As…
This official Hugging Face blog post introduces how to use their hosted service "Inference Endpoints" to deploy large language models (LLMs). With the rapid…
The Falcon series of large language models (including Falcon-40B and Falcon-7B), developed by Abu Dhabi's Technology Innovation Institute (TII), have…
This article explains how to accelerate the deployment and inference of Hugging Face Transformers models using AWS Inferentia2 (Inf2 instances) — AWS's…