Mistral AI published an engineering deep dive on a memory leak found during vLLM disaggregated serving tests. The leak appeared only with a specific stack involving Mistral Medium 3.1, NIXL, UCX, graph compilation, and P/D disaggregation, with RSS growing steadily despite heap profilers looking normal. The team used pmap, BPFtrace, and targeted GDB automation to trace the issue to UCX mmap hooks and applied configuration fixes plus a vLLM patch.
As the demand for deploying large language models (LLMs) in production environments surges, how to improve inference efficiency and reduce costs has become a…
This blog post published by the ServiceNow AI team delves into the major transition of the open-source large language model inference engine vLLM from V0 to…
This technical blog post from Hugging Face takes a "First Principles" approach to provide a deep analysis of one of the most critical optimization techniques…
SGLang (Structured Generation Language) is a high-performance LLM inference and serving framework developed by the LMSYS team, renowned for its efficient…
As the context windows of large language models (LLMs) continue to expand — from the early 4k and 8k, to the now-common 32k and even 128k or more — users have…
When deploying large language models (LLMs), maintaining low latency and high throughput under high concurrency (concurrent requests) is one of the greatest…
Hugging Face's official blog has announced that its widely adopted open-source large model inference framework, Text Generation Inference (TGI), now officially…