Latest in AI

Showing:kv-cacheDevelopersClear ×

🔥 Trending today

anthropic6 export-controls4 model-access3 amazon3 national-security2 open-source2 ai-regulation2 government-policy2 enterprise-ai2 compliance2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

FlashMemory-DeepSeek-V4: Ultra-Long Context via Lookahead Sparse Attention
r/LocalLLaMA top day4 days agoPaper
FlashMemory-DeepSeek-V4 introduces Lookahead Sparse Attention (LSA), a predictive inference paradigm that retains only query-critical KV chunks in GPU memory instead of the full cache. A Neural Memory Indexer, trained independently using a backbone-free dual-encoder strategy, proactively forecasts which historical tokens will matter next. The system compresses average KV cache footprint by 86.5% and exceeds 90% compression at 500K-token scales, while delivering a slight accuracy gain of +0.6% on long-context benchmarks.
OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
r/LocalLLaMA top day5 days agoPaper
OSCAR applies offline-precomputed rotation matrices—derived from spectral covariance analysis—to reshape KV tensor distributions before 2-bit quantization, suppressing outliers and reducing rounding error. The rotation adds negligible inference overhead since it requires no runtime learning. GGUF downloads for Gemma-4-12B-it, Qwen3-32B, and Qwen3-4B-Thinking are available, with llama.cpp and sglang integrations and an arXiv paper.
Qwen3.6-35B-A3B Tool Calling Benchmark: ByteShape vs Unsloth GGUFs
r/LocalLLaMA top day6 days agoBenchmark
The post benchmarks eight Qwen3.6-35B-A3B GGUF quants from ByteShape and Unsloth using llama.cpp and tool-eval-bench. It compares f16, q8_0, and q4_0 KV cache quantization under short and long-context pressure, totaling 144 runs and roughly 300 GPU-hours. The author reports no clear ByteShape versus Unsloth winner, q8_0 as close to a free lunch, q4_0 as weaker, and long context as a major tool-calling degradation factor.
llama.cpp PR #24277 avoids KV cell copies in kv-cache
r/LocalLLaMA top day6 days agoRelease
ggml-org/llama.cpp merged PR #24277 by ggerganov, titled “kv-cache: avoid kv cells copies.” The Reddit post says the change improves MTP performance for Gemma-4 and was merged the previous day. It is available starting with the b9551 release, making it relevant for local inference users tracking llama.cpp performance updates.
Qwen 3.6 27B KV Cache Quantization Benchmarks: KVarN, Turbo, and TCQ Evaluated
r/LocalLLaMA top day7 days agoBenchmark
Reddit user Anbeeld shared comprehensive KV cache quantization benchmarks for Qwen 3.6 27B across 75 configuration pairs. Using BeeLlama.cpp (a custom llama.cpp fork), the test evaluates q8, q6, q5, and q4 quantization levels. It specifically highlights advanced implementations like KVarN, TurboQuant, and TCQ to optimize long-context inference efficiency.
從第一性原理理解連續批處理（Continuous Batching）★ 80
Hugging Face Blog201 days agoTutorial
This technical blog post from Hugging Face takes a "First Principles" approach to provide a deep analysis of one of the most critical optimization techniques…
你可以直接用在 Transformers 的 OpenAI gpt-oss 加速妙招 🫵★ 82
Hugging Face Blog276 days agoTutorial
### Background and the LLM Inference Bottleneck When running large language models (LLMs), autoregressive generation is inherently "memory-bandwidth-bound"…
長 Prompt 如何阻塞其他請求？優化 LLM 推理效能與解決隊頭阻塞的關鍵策略★ 80
Hugging Face Blog367 days agoTutorial
As the context windows of large language models (LLMs) continue to expand — from the early 4k and 8k, to the now-common 32k and even 128k or more — users have…
從零開始在 nanoVLM 中實作 KV Cache★ 75
Hugging Face Blog375 days agoTutorial
In the inference process of large language models (LLMs) and vision-language models (VLMs), autoregressive decoding is a major performance bottleneck. Each…
併發請求下的 Prefill 與 Decode：優化 LLM 推論效能的關鍵技術★ 82
Hugging Face Blog424 days agoTutorial
When deploying large language models (LLMs), maintaining low latency and high throughput under high concurrency (concurrent requests) is one of the greatest…
效率化請求佇列：優化 LLM 推論效能的關鍵策略★ 75
Hugging Face Blog438 days agoTutorial
### The Unique Challenges and Memory Bottlenecks of LLM Inference Traditional web services primarily handle concurrent requests through multi-threading or…
使用 KVPress 掌握大語言模型（LLM）的長文本處理能力★ 75
Hugging Face Blog507 days agoNew Tool
In the current trajectory of large language model (LLM) development, support for long contexts has become a standard requirement. However, as input text length…
解鎖更長的文本生成：深入探討 Key-Value (KV) 快取量化技術★ 80
Hugging Face Blog759 days agoTutorial
During the inference process of large language models (LLMs), the self-attention mechanism needs to store the Key and Value vectors of historical tokens (i.e…