Latest in AI

Showing:inference-optimizationResearchersClear ×

🔥 Trending today

anthropic6 export-controls4 model-access3 amazon3 national-security2 open-source2 ai-regulation2 government-policy2 enterprise-ai2 compliance2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Why MoE Models Benefit More from Speculative Decoding
Cohere Blog2 days agoBenchmark
Cohere analyzes why speculative decoding behaves differently on Mixture-of-Experts models than on dense LLMs. Its benchmarks show MoE speedups can peak at moderate batch sizes because sparse expert routing keeps verification bandwidth-bound. The post also finds that temporal expert overlap and fixed overhead amortization make multi-token verification cheaper than simple worst-case models predict.
Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question
r/LocalLLaMA top day4 days agoBenchmark
A Reddit user is running Qwen3.6-MTP-27B-MTP in Q4_K_M GGUF format with llama.cpp server on a 32GB Tesla V100. They report one peak of 55 tokens per second, but typical throughput is closer to 44-48 TPS. The post asks whether flags such as parallelism, speculative MTP draft settings, KV cache quantization, flash attention, and a 262K context window are limiting performance without improving output quality.
OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
r/LocalLLaMA top day5 days agoPaper
OSCAR applies offline-precomputed rotation matrices—derived from spectral covariance analysis—to reshape KV tensor distributions before 2-bit quantization, suppressing outliers and reducing rounding error. The rotation adds negligible inference overhead since it requires no runtime learning. GGUF downloads for Gemma-4-12B-it, Qwen3-32B, and Qwen3-4B-Thinking are available, with llama.cpp and sglang integrations and an arXiv paper.
Watch agents fight: a live challenge to speed up Gemma 4 E4B inference on a single A10G
r/LocalLLaMA top day5 days agoBenchmark
A public HuggingFace Spaces dashboard hosts a live competition where AI agents race to optimize Gemma 4 E4B inference throughput on a single NVIDIA A10G GPU. The challenge gamifies ML inference engineering, letting anyone watch agents explore quantization and scheduling strategies in real time. Optimization recipes surfaced by the competition offer practical value for developers targeting single-GPU self-hosted Gemma 4 deployments.
你可以直接用在 Transformers 的 OpenAI gpt-oss 加速妙招 🫵★ 82
Hugging Face Blog276 days agoTutorial
### Background and the LLM Inference Bottleneck When running large language models (LLMs), autoregressive generation is inherently "memory-bandwidth-bound"…
讓你的 ZeroGPU Spaces 速度飛起：利用 PyTorch AOT 提前編譯技術消除冷啟動延遲★ 75
Hugging Face Blog285 days agoTutorial
Hugging Face's ZeroGPU Spaces offers developers a free and efficient way to deploy GPU-accelerated AI applications. However, ZeroGPU uses a dynamic allocation…
使用 Diffusers 與 PEFT 實現 Flux 的快速 LoRA 推論★ 80
Hugging Face Blog326 days agoTutorial
This technical guide from Hugging Face takes an in-depth look at how to accelerate LoRA (Low-Rank Adaptation) inference for Flux.1, the powerful open-source…
從零開始在 nanoVLM 中實作 KV Cache★ 75
Hugging Face Blog375 days agoTutorial
In the inference process of large language models (LLMs) and vision-language models (VLMs), autoregressive decoding is a major performance bottleneck. Each…
Bamba：高推論效率的混合 Mamba2 開源模型正式發布★ 75
Hugging Face Blog543 days agoRelease
### Background and Architectural Innovation As large language models (LLMs) have advanced rapidly, the traditional Transformer architecture faces severe…
使用自投機解碼（Self-Speculative Decoding）加速文本生成：Meta 推出 LayerSkip 技術★ 78
Hugging Face Blog571 days agoRelease
The slow autoregressive generation speed of large language models (LLMs) has long been a major bottleneck in real-world deployment. While "speculative…
Universal Assisted Generation：支援任意輔助模型的通用輔助生成技術，大幅提升解碼速度★ 85
Hugging Face Blog593 days agoRelease
In the deployment and inference of large language models (LLMs), reducing generation latency has always been a critical challenge. The traditional approach of…
透過動態投機（Dynamic Speculation）加速 Hugging Face 輔助生成（Assisted Generation）★ 75
Hugging Face Blog614 days agoRelease
Hugging Face has published a technical blog post on "Dynamic Speculation," aimed at optimizing the inference speed of large language models (LLMs)…
Intel Gaudi 支援更快的輔助生成（Assisted Generation），顯著提升 LLM 推理速度
Hugging Face Blog740 days agoRelease
Hugging Face, in collaboration with Intel, has announced official support for "Assisted Generation" (also commonly known as Speculative Decoding) on Intel…
使用 ONNX Runtime 與 Olive 加速 SD Turbo 和 SDXL Turbo 推論★ 75
Hugging Face Blog881 days agoTutorial
SD Turbo and SDXL Turbo are single-step/few-step text-to-image models from Stability AI, with their core innovation being Adversarial Diffusion Distillation…
使用投機解碼（Speculative Decoding）將 Whisper 推論速度提升 2 倍★ 75
Hugging Face Blog907 days agoTutorial
The Hugging Face official blog introduces how to use "Speculative Decoding" to more than double the inference speed of OpenAI's Whisper speech-to-text model…
使用 ONNX Runtime 加速超過 130,000 個 Hugging Face 模型★ 75
Hugging Face Blog984 days agoNew Tool
Hugging Face officially announced a deep collaboration with Microsoft to integrate ONNX Runtime (ORT) into the Hugging Face ecosystem. This partnership enables…
使用 JAX 與 Cloud TPU v5e 加速 Stable Diffusion XL 推理★ 70
Hugging Face Blog985 days agoTutorial
With the widespread adoption of high-quality open-source image generation models like Stable Diffusion XL (SDXL), reducing inference latency and controlling…
介紹 RWKV：兼具 Transformer 優勢的全新 RNN 架構★ 75
Hugging Face Blog1,126 days agoRelease
Hugging Face has announced official support for RWKV (Receptive Weighted Key Value) models in its `transformers` library. RWKV is an innovative architecture…
Hugging Face 推出 Assisted Generation：邁向低延遲文本生成的新方向★ 85
Hugging Face Blog1,130 days agoRelease
Large language models (LLMs) typically generate text using an "autoregressive" mechanism, meaning the model must generate one token at a time. Each generation…
使用 TensorFlow 與 XLA 加速文本生成
Hugging Face Blog1,418 days agoTutorial
This Hugging Face technical blog post takes an in-depth look at how to use TensorFlow's XLA (Accelerated Linear Algebra) compiler to dramatically speed up the…
使用 Hugging Face Transformers 與 AWS Inferentia 加速 BERT 推論
Hugging Face Blog1,551 days agoTutorial
When deploying large language models such as BERT in production environments, inference latency and computational cost are often two major pain points for…
Hugging Face Transformers 中的 TensorFlow 模型加速與 TF Serving 部署指南
Hugging Face Blog1,965 days agoTutorial
When deploying Transformer models in production environments, latency and throughput are often the deciding factors for a project's success. Hugging Face…

Latest in AI

Why MoE Models Benefit More from Speculative Decoding

Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Watch agents fight: a live challenge to speed up Gemma 4 E4B inference on a single A10G

你可以直接用在 Transformers 的 OpenAI gpt-oss 加速妙招 🫵★ 82

讓你的 ZeroGPU Spaces 速度飛起：利用 PyTorch AOT 提前編譯技術消除冷啟動延遲★ 75

使用 Diffusers 與 PEFT 實現 Flux 的快速 LoRA 推論★ 80

從零開始在 nanoVLM 中實作 KV Cache★ 75

Bamba：高推論效率的混合 Mamba2 開源模型正式發布★ 75

使用自投機解碼（Self-Speculative Decoding）加速文本生成：Meta 推出 LayerSkip 技術★ 78

Universal Assisted Generation：支援任意輔助模型的通用輔助生成技術，大幅提升解碼速度★ 85

透過動態投機（Dynamic Speculation）加速 Hugging Face 輔助生成（Assisted Generation）★ 75

Intel Gaudi 支援更快的輔助生成（Assisted Generation），顯著提升 LLM 推理速度

使用 ONNX Runtime 與 Olive 加速 SD Turbo 和 SDXL Turbo 推論★ 75

使用投機解碼（Speculative Decoding）將 Whisper 推論速度提升 2 倍★ 75

使用 ONNX Runtime 加速超過 130,000 個 Hugging Face 模型★ 75

使用 JAX 與 Cloud TPU v5e 加速 Stable Diffusion XL 推理★ 70

介紹 RWKV：兼具 Transformer 優勢的全新 RNN 架構★ 75

Hugging Face 推出 Assisted Generation：邁向低延遲文本生成的新方向★ 85

使用 TensorFlow 與 XLA 加速文本生成

使用 Hugging Face Transformers 與 AWS Inferentia 加速 BERT 推論

Hugging Face Transformers 中的 TensorFlow 模型加速與 TF Serving 部署指南