In the fields of robot learning and embodied AI, enabling controllers based on deep learning or large language/vision models (VLAs) to run in real time has…
SGLang (Structured Generation Language) is a high-performance LLM inference and serving framework developed by the LMSYS team, renowned for its efficient…
Hugging Face announced a deep partnership with Groq, a chip company focused on ultra-fast AI inference, formally bringing Groq into the Hugging Face "Inference…
Hugging Face officially announced a partnership with Featherless AI, a serverless GPU inference platform, integrating it into the Hugging Face Inference…
The AI-managed inference platform Replicate has announced a deep partnership with Hugging Face, the giant of the open-source AI community, officially bringing…
Hugging Face recently announced a brand-new, ultra-fast optimized deployment solution for OpenAI's open-source speech recognition model Whisper on its hosted…
When deploying large language models (LLMs), maintaining low latency and high throughput under high concurrency (concurrent requests) is one of the greatest…
### The Unique Challenges and Memory Bottlenecks of LLM Inference Traditional web services primarily handle concurrent requests through multi-threading or…
Hugging Face's official blog has announced that its widely adopted open-source large model inference framework, Text Generation Inference (TGI), now officially…
Hugging Face recently announced a major upgrade to its hosted model deployment service, "Inference Endpoints," introducing a brand-new and far more modern…
Vercel has officially announced that three prominent AI infrastructure service providers — Groq, fal, and DeepInfra — have formally joined the Vercel…
On February 18, 2025, Hugging Face announced the addition of three new partners to its serverless inference ecosystem: Hyperbolic, Nebius AI Studio, and Novita…
On February 14, 2025, Hugging Face — the leading open-source AI community — officially announced the integration of high-performance AI inference platform…
In the current era of generative AI sweeping the globe, many developers habitually feed all tasks — including simple text classification, sentiment analysis…
As DeepSeek-R1 swept through the AI landscape on the strength of its powerful reasoning capabilities, how to safely and efficiently deploy and fine-tune these…
Hugging Face has officially launched the "Inference Providers" feature on the Hugging Face Hub — a major update designed to address the pain points developers…
Text Generation Inference (TGI), Hugging Face's open-source LLM inference and deployment framework, has received a major architectural update, officially…
The AI deployment platform Replicate has announced the official availability of NVIDIA L40S GPU compute on its platform. This update aims to provide developers…
The deployment of large language models (LLMs) has long faced a dual bottleneck of VRAM capacity and memory bandwidth. Microsoft previously introduced the…
GGML is a lightweight, zero-dependency C/C++ tensor library developed by Georgi Gerganov. It was originally designed to enable efficient local inference of the…
Hugging Face and NVIDIA announced a major partnership in late July 2024, officially launching a serverless inference service powered by NVIDIA NIM (NVIDIA…
The Hugging Face official blog has introduced a major update to its open-source text generation inference engine, Text Generation Inference (TGI): the…
Hugging Face announced a deep partnership with Google Cloud, officially integrating Google Cloud TPUs (Tensor Processing Units) into the Hugging Face platform…
The official blog of Replicate, the popular AI model hosting and deployment platform, has announced that NVIDIA H100 Tensor Core GPUs will soon be officially…
This official Hugging Face blog post takes an in-depth look at how to benchmark Text Generation Inference (TGI), Hugging Face's open-source LLM inference and…
Hugging Face has announced official support for AWS Inferentia2 (Inf2) instances within its hosted Inference Endpoints service. This update gives developers…
As enterprise demand for Retrieval-Augmented Generation (RAG) technology surges, how to maintain high performance while controlling hardware costs has become…
This article introduces how to run privacy-preserving inference based on Fully Homomorphic Encryption (FHE) on Hugging Face Endpoints. In traditional…
In real-world generative AI applications, fine-tuning for specific tasks or clients is a common requirement. However, deploying a full base model for every…
Hugging Face announced the launch of a new open-source library called "Optimum-NVIDIA," the result of a deep collaboration with NVIDIA, aimed at seamlessly…