Latest in AI

Showing:inferenceResearchersClear ×

🔥 Trending today

anthropic6 export-controls4 model-access3 amazon3 national-security2 open-source2 ai-regulation2 government-policy2 enterprise-ai2 compliance2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Production-Ready W4A8: vLLM Integration and Quality Recovery Techniques
Cohere Blog2 days agoTutorial
Cohere’s post appears to explain how W4A8 quantization can be prepared for production inference through vLLM integration. From the title, the focus is likely on deployment mechanics and techniques for recovering model quality after aggressive quantization. Because no article body is available, specific benchmarks, supported models, implementation steps, and measured quality gains cannot be confirmed.
Furiosa AI inference chip could be a game changer for local LLMs
r/LocalLLaMA top day4 days agoHardware
A r/LocalLLaMA post discusses Furiosa AI’s RNGD inference chip, citing TSMC 5nm, Hynix HBM3, 48GB VRAM, 1.5TB/s bandwidth, and 180W TDP. The author argues it could matter for local LLM users if Furiosa opens its programming interface and works with llama.cpp on a GGML backend. The post later clarifies Furiosa is not selling to consumers; this is a wish and market commentary, not a launch.
Packed twin inference doubles Qwen3.6-27B throughput on one MI50
r/LocalLLaMA top day5 days agoBenchmark
A LocalLLaMA user shared an early packed-twin-inference experiment for local LLM acceleration. The idea resembles speculative decoding, but uses the same quantized model side-by-side instead of a smaller draft model. On a single AMD MI50, the author reports Qwen3.6-27B improving from 19.4 to 38.1 tk/s, with Q8-or-lower quantization as the main target.
A llama.cpp CLI Command Builder
r/LocalLLaMA top day5 days agoNew Tool
A r/LocalLLaMA post introduces a llama.cpp CLI Command Builder with no accounts, email, pop-ups, cookies, or ads. It stores information locally in the browser and includes editable fields for flags and arguments found in the documentation. Users can build CLI or server commands, log run information, and compare which configurations work best for their hardware; only Linux is currently supported.
Pipeline parallelism in llama.cpp may be wasting your VRAM
r/LocalLLaMA top day5 days agoBenchmark
The author compared three llama.cpp Vulkan builds: default 4 sched copies, 1 sched copy, and no pipeline parallelism. In their Qwen GGUF test, input and output throughput were nearly identical across all configurations. However, the default setting used about 1.5GB more VRAM for compute buffers and reduced usable context from roughly 113K tokens to around 88K, though parallel-request benefits were not tested.
Xiaomi Claims 1,000+ TPS on a 1T Model Using a Standard 8-GPU Server★ 72
r/LocalLLaMA top day6 days agoBenchmark
Xiaomi announced MiMo-V2.5-Pro-UltraSpeed with TileRT, claiming over 1,000 tokens/s decode speed on a 1-trillion-parameter MoE model. The company says it runs on a single standard 8-GPU commodity node, not wafer-scale or SRAM-heavy specialized hardware. The claimed stack combines FP4 MoE expert quantization, DFlash speculative decoding, and TileRT low-latency inference kernels, but independent validation is still needed.
Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72
r/LocalLLaMA top day6 days agoNew Tool
Luce Spark is an open-source MoE offload system for running 33B-35B A3B models on 16GB-class GPUs. It keeps frequently routed experts on GPU, stores the long tail in system RAM, and swaps cold experts through a bounded async cache. The author reports 13.3 GiB for Qwen3.6 35B-A3B and about 100 tok/s with Spark optimizations, but notes real 16GB GPU testing is still missing.
Qwen 3.6 27B KV Cache Quantization Benchmarks: KVarN, Turbo, and TCQ Evaluated
r/LocalLLaMA top day7 days agoBenchmark
Reddit user Anbeeld shared comprehensive KV cache quantization benchmarks for Qwen 3.6 27B across 75 configuration pairs. Using BeeLlama.cpp (a custom llama.cpp fork), the test evaluates q8, q6, q5, and q4 quantization levels. It specifically highlights advanced implementations like KVarN, TurboQuant, and TCQ to optimize long-context inference efficiency.
Xcena raises $135M betting AI’s bottleneck is memory, not compute
TechCrunch AI16 days agoHardware
South Korean chip startup Xcena raised a $135 million Series B at a $570 million valuation, bringing total funding to $185 million. The company argues AI inference is increasingly constrained by memory movement, not just GPU compute. Its prototype MX1 chip uses CXL to process data closer to DRAM, with Samsung foundry mass production planned by late 2026 and revenue targeted for 2027.
Protecting against inference theft
Vercel Changelog16 days agoCommentary
Only the title is available, so specific Vercel product changes or implementation steps cannot be confirmed. The topic appears to focus on protecting AI inference resources from unauthorized access, abuse, or cost-draining traffic. For teams deploying AI apps, the practical takeaway is to treat inference endpoints as high-value backend assets requiring access control, monitoring, and abuse prevention.
AI 進入推論時代！AMD 蘇姿丰看好 CPU 市場年增 35%，架構將趨向「1:1」★ 75
INSIDE 硬塞 AI23 days agoOpinion
AMD CEO Lisa Su recently shared her latest views on the AI hardware market, pointing out that the AI industry is approaching a critical inflection point…
解鎖連續批次處理（Continuous Batching）中的非同步機制★ 75
Hugging Face Blog31 days agoRelease
As the demand for deploying large language models (LLMs) in production environments surges, how to improve inference efficiency and reduce costs has become a…
在 AWS 上進行基礎模型訓練與推理的建構基石 (Building Blocks)★ 75
Hugging Face Blog33 days agoTutorial
In the era of generative AI, training and deploying foundation models with billions of parameters faces enormous computational and architectural challenges…
DeepInfra 正式加入 Hugging Face 推理服務商（Inference Providers）陣容 🔥★ 72
Hugging Face Blog46 days agoRelease
Hugging Face's official blog has announced that DeepInfra — a well-known high-performance, low-cost serverless inference platform — has officially joined…
從第一性原理理解連續批處理（Continuous Batching）★ 80
Hugging Face Blog201 days agoTutorial
This technical blog post from Hugging Face takes a "First Principles" approach to provide a deep analysis of one of the most critical optimization techniques…
OVHcloud 正式加入 Hugging Face 推理供應商行列，主打歐洲數據主權與高性價比算力★ 72
Hugging Face Blog202 days agoRelease
Hugging Face has announced a new partnership with OVHcloud, Europe's leading cloud infrastructure provider, officially incorporating OVHcloud into Hugging Face…
只要三個簡單步驟，就能在 Intel CPU 上運行 VLM 視覺語言模型★ 70
Hugging Face Blog242 days agoTutorial
Visual Language Models (VLMs) combine computer vision with natural language processing, enabling complex tasks such as image captioning and visual question…
Scaleway 正式加入 Hugging Face 推理提供商（Inference Providers）🔥★ 70
Hugging Face Blog268 days agoRelease
Hugging Face has announced a deep partnership with Scaleway, a leading European cloud infrastructure provider, with Scaleway officially joining the Hugging…
Hugging Face 推理提供商迎來新夥伴：Public AI 正式上線 🔥★ 70
Hugging Face Blog270 days agoRelease
Hugging Face continues to expand its "Inference Providers" program, aimed at enabling developers to run open-source models from Hugging Face Hub in the…
使用 Torch Compile 快取加速模型啟動與推論速度★ 75
Replicate Blog279 days agoTutorial
When deploying modern AI models (such as LLaMA, Flux, or Stable Diffusion), `torch.compile` — introduced in PyTorch 2.0 — is a powerful performance…
在 Hugging Face 上透過 NVIDIA NIM 加速多樣化 LLM 部署★ 80
Hugging Face Blog328 days agoRelease
Hugging Face and NVIDIA have announced a new collaboration to bring NVIDIA NIM (NVIDIA Inference Microservices) into the Hugging Face ecosystem, with the goal…
Replicate 如何優化 FLUX.1 Kontext [dev]：深入解析 Taylor Seer 優化技術★ 75
Replicate Blog334 days agoTutorial
In the generative AI space, the FLUX.1 model developed by Black Forest Labs is renowned for its outstanding image quality and text rendering capabilities…
非同步機器人推論：解耦動作預測與執行★ 75
Hugging Face Blog339 days agoOpinion
In the fields of robot learning and embodied AI, enabling controllers based on deep learning or large language/vision models (VLAs) to run in real time has…
SGLang 整合 Hugging Face Transformers 後端：大幅提升模型相容性與開發彈性★ 75
Hugging Face Blog356 days agoRelease
SGLang (Structured Generation Language) is a high-performance LLM inference and serving framework developed by the LMSYS team, renowned for its efficient…
Groq 正式加入 Hugging Face 推理提供商（Inference Providers）支援極速開源模型推理★ 75
Hugging Face Blog363 days agoRelease
Hugging Face announced a deep partnership with Groq, a chip company focused on ultra-fast AI inference, formally bringing Groq into the Hugging Face "Inference…
Featherless AI 正式加入 Hugging Face 推理供應商（Inference Providers）★ 75
Hugging Face Blog367 days agoRelease
Hugging Face officially announced a partnership with Featherless AI, a serverless GPU inference platform, integrating it into the Hugging Face Inference…
在 Hugging Face 上用 Replicate 運行超過 30,000 個 LoRA 模型★ 75
Replicate Blog395 days agoNew Tool
The AI-managed inference platform Replicate has announced a deep partnership with Hugging Face, the giant of the open-source AI community, officially bringing…
Hugging Face 推出極速 Whisper 語音轉文字 Inference Endpoints 部署方案★ 75
Hugging Face Blog397 days agoNew Tool
Hugging Face recently announced a brand-new, ultra-fast optimized deployment solution for OpenAI's open-source speech recognition model Whisper on its hosted…
併發請求下的 Prefill 與 Decode：優化 LLM 推論效能的關鍵技術★ 82
Hugging Face Blog424 days agoTutorial
When deploying large language models (LLMs), maintaining low latency and high throughput under high concurrency (concurrent requests) is one of the greatest…
效率化請求佇列：優化 LLM 推論效能的關鍵策略★ 75
Hugging Face Blog438 days agoTutorial
### The Unique Challenges and Memory Bottlenecks of LLM Inference Traditional web services primarily handle concurrent requests through multi-threading or…

Page 1Next →

Latest in AI

Production-Ready W4A8: vLLM Integration and Quality Recovery Techniques

Furiosa AI inference chip could be a game changer for local LLMs

Packed twin inference doubles Qwen3.6-27B throughput on one MI50

A llama.cpp CLI Command Builder

Pipeline parallelism in llama.cpp may be wasting your VRAM

Xiaomi Claims 1,000+ TPS on a 1T Model Using a Standard 8-GPU Server★ 72

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72

Qwen 3.6 27B KV Cache Quantization Benchmarks: KVarN, Turbo, and TCQ Evaluated

Xcena raises $135M betting AI’s bottleneck is memory, not compute

Protecting against inference theft

AI 進入推論時代！AMD 蘇姿丰看好 CPU 市場年增 35%，架構將趨向「1:1」★ 75

解鎖連續批次處理（Continuous Batching）中的非同步機制★ 75

在 AWS 上進行基礎模型訓練與推理的建構基石 (Building Blocks)★ 75

DeepInfra 正式加入 Hugging Face 推理服務商（Inference Providers）陣容 🔥★ 72

從第一性原理理解連續批處理（Continuous Batching）★ 80

OVHcloud 正式加入 Hugging Face 推理供應商行列，主打歐洲數據主權與高性價比算力★ 72

只要三個簡單步驟，就能在 Intel CPU 上運行 VLM 視覺語言模型★ 70

Scaleway 正式加入 Hugging Face 推理提供商（Inference Providers）🔥★ 70

Hugging Face 推理提供商迎來新夥伴：Public AI 正式上線 🔥★ 70

使用 Torch Compile 快取加速模型啟動與推論速度★ 75

在 Hugging Face 上透過 NVIDIA NIM 加速多樣化 LLM 部署★ 80

Replicate 如何優化 FLUX.1 Kontext [dev]：深入解析 Taylor Seer 優化技術★ 75

非同步機器人推論：解耦動作預測與執行★ 75

SGLang 整合 Hugging Face Transformers 後端：大幅提升模型相容性與開發彈性★ 75

Groq 正式加入 Hugging Face 推理提供商（Inference Providers）支援極速開源模型推理★ 75

Featherless AI 正式加入 Hugging Face 推理供應商（Inference Providers）★ 75

在 Hugging Face 上用 Replicate 運行超過 30,000 個 LoRA 模型★ 75

Hugging Face 推出極速 Whisper 語音轉文字 Inference Endpoints 部署方案★ 75

併發請求下的 Prefill 與 Decode：優化 LLM 推論效能的關鍵技術★ 82

效率化請求佇列：優化 LLM 推論效能的關鍵策略★ 75