Latest in AI

Showing:inferenceClear ×

🔥 Trending today

anthropic7 export-controls4 model-access3 spacex3 amazon3 national-security2 open-source2 governance2 ai-policy2 ai-regulation2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

NVIDIA Blackwell Leads First Agentic AI Infrastructure Benchmark★ 72
NVIDIA BlogyesterdayBenchmark
NVIDIA reports that its GB300 NVL72 platform leads the first published AgentPerf results from Artificial Analysis, a benchmark designed for agentic AI infrastructure. The benchmark uses DeepSeek V4 Pro and coding-agent-style workloads with long sequences, simulated tool delays, and concurrency targets. NVIDIA attributes the gains to rack-scale Blackwell design, CUDA optimizations, and TensorRT LLM, claiming up to 20x more agents per megawatt than HGX H200.
Production-Ready W4A8: vLLM Integration and Quality Recovery Techniques
Cohere Blog2 days agoTutorial
Cohere’s post appears to explain how W4A8 quantization can be prepared for production inference through vLLM integration. From the title, the focus is likely on deployment mechanics and techniques for recovering model quality after aggressive quantization. Because no article body is available, specific benchmarks, supported models, implementation steps, and measured quality gains cannot be confirmed.
Furiosa AI inference chip could be a game changer for local LLMs
r/LocalLLaMA top day4 days agoHardware
A r/LocalLLaMA post discusses Furiosa AI’s RNGD inference chip, citing TSMC 5nm, Hynix HBM3, 48GB VRAM, 1.5TB/s bandwidth, and 180W TDP. The author argues it could matter for local LLM users if Furiosa opens its programming interface and works with llama.cpp on a GGML backend. The post later clarifies Furiosa is not selling to consumers; this is a wish and market commentary, not a launch.
Packed twin inference doubles Qwen3.6-27B throughput on one MI50
r/LocalLLaMA top day5 days agoBenchmark
A LocalLLaMA user shared an early packed-twin-inference experiment for local LLM acceleration. The idea resembles speculative decoding, but uses the same quantized model side-by-side instead of a smaller draft model. On a single AMD MI50, the author reports Qwen3.6-27B improving from 19.4 to 38.1 tk/s, with Q8-or-lower quantization as the main target.
A llama.cpp CLI Command Builder
r/LocalLLaMA top day5 days agoNew Tool
A r/LocalLLaMA post introduces a llama.cpp CLI Command Builder with no accounts, email, pop-ups, cookies, or ads. It stores information locally in the browser and includes editable fields for flags and arguments found in the documentation. Users can build CLI or server commands, log run information, and compare which configurations work best for their hardware; only Linux is currently supported.
Pipeline parallelism in llama.cpp may be wasting your VRAM
r/LocalLLaMA top day5 days agoBenchmark
The author compared three llama.cpp Vulkan builds: default 4 sched copies, 1 sched copy, and no pipeline parallelism. In their Qwen GGUF test, input and output throughput were nearly identical across all configurations. However, the default setting used about 1.5GB more VRAM for compute buffers and reduced usable context from roughly 113K tokens to around 88K, though parallel-request benefits were not tested.
Xiaomi Claims 1,000+ TPS on a 1T Model Using a Standard 8-GPU Server★ 72
r/LocalLLaMA top day6 days agoBenchmark
Xiaomi announced MiMo-V2.5-Pro-UltraSpeed with TileRT, claiming over 1,000 tokens/s decode speed on a 1-trillion-parameter MoE model. The company says it runs on a single standard 8-GPU commodity node, not wafer-scale or SRAM-heavy specialized hardware. The claimed stack combines FP4 MoE expert quantization, DFlash speculative decoding, and TileRT low-latency inference kernels, but independent validation is still needed.
Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72
r/LocalLLaMA top day6 days agoNew Tool
Luce Spark is an open-source MoE offload system for running 33B-35B A3B models on 16GB-class GPUs. It keeps frequently routed experts on GPU, stores the long tail in system RAM, and swaps cold experts through a bounded async cache. The author reports 13.3 GiB for Qwen3.6 35B-A3B and about 100 tok/s with Spark optimizations, but notes real 16GB GPU testing is still missing.
How the UK Is Turning Sovereign AI Ambition Into Action With NVIDIA Technologies★ 72
NVIDIA Blog6 days agoBusiness
NVIDIA says the UK’s “AI maker” strategy is moving into deployment through domestic AI cloud infrastructure, Isambard-AI, and the Sovereign AI Fund. UK startups are using NVIDIA technologies for coding agents, self-improving AI, inference optimization, and biological foundation models. The post also covers NVIDIA’s UK startup investment, developer training, 6G collaboration, and enterprise AI projects moving from pilots into production.
Qwen 3.6 27B KV Cache Quantization Benchmarks: KVarN, Turbo, and TCQ Evaluated
r/LocalLLaMA top day7 days agoBenchmark
Reddit user Anbeeld shared comprehensive KV cache quantization benchmarks for Qwen 3.6 27B across 75 configuration pairs. Using BeeLlama.cpp (a custom llama.cpp fork), the test evaluates q8, q6, q5, and q4 quantization levels. It specifically highlights advanced implementations like KVarN, TurboQuant, and TCQ to optimize long-context inference efficiency.
Launch HN: General Instinct (YC P26) - Frontier models on edge devices
Hacker News (AI keywords)9 days agoNew Tool
General Instinct is a YC P26 company introduced through a Launch HN post. Its headline positioning is bringing frontier models to edge devices, suggesting local or embedded AI deployment rather than purely cloud-based inference. Since no article body is available, details such as supported models, hardware, benchmarks, pricing, and developer tooling cannot be verified from the provided source.
Qualcomm Unveils Dragonfly Data Center Brand for the Agentic AI Era
INSIDE 硬塞 AI13 days agoHardware
At Computex 2026, Qualcomm described AI agents as a major driver of cross-device hardware upgrades. The company unveiled Dragonfly, a new data center brand focused on inference computing. The announcement outlines a broader strategy spanning endpoint devices and cloud infrastructure, although the source does not provide specifications, performance figures, or deployment timelines.
After Nvidia’s $20B not-aqui-hire, AI chip startup Groq reportedly raising $650M
TechCrunch AI16 days agoHardware
TechCrunch cites Axios reporting that AI chipmaker Groq is seeking $650 million in internal funding. The company is reportedly pivoting from hardware toward AI inference, the stage focused on how models respond to prompts. The report comes after Nvidia’s $20 billion not-aqui-hire, underscoring continued investor attention around AI compute and inference infrastructure.
Xcena raises $135M betting AI’s bottleneck is memory, not compute
TechCrunch AI16 days agoHardware
South Korean chip startup Xcena raised a $135 million Series B at a $570 million valuation, bringing total funding to $185 million. The company argues AI inference is increasingly constrained by memory movement, not just GPU compute. Its prototype MX1 chip uses CXL to process data closer to DRAM, with Samsung foundry mass production planned by late 2026 and revenue targeted for 2027.
Protecting against inference theft
Vercel Changelog16 days agoCommentary
Only the title is available, so specific Vercel product changes or implementation steps cannot be confirmed. The topic appears to focus on protecting AI inference resources from unauthorized access, abuse, or cost-draining traffic. For teams deploying AI apps, the practical takeaway is to treat inference endpoints as high-value backend assets requiring access control, monitoring, and abuse prevention.
Has the hunt for AI compute uncovered the next Cerebras?
TechCrunch AI17 days agoHardware
TechCrunch reports that General Compute has raised a $15 million seed round at a $60 million post-money valuation to build an AI inference neocloud. The company is ordering $300 million of SambaNova SN50 chips, betting they can outperform GPUs and rival specialized chips for inference. The story frames inference speed, deployment flexibility, and lower power needs as key battlegrounds in AI infrastructure.
New AI Infra Decacorns: Fireworks, Baseten, and OpenRouter★ 78
Latent Space18 days agoBusiness
AI infrastructure startups Fireworks and Baseten have reportedly reached massive valuations, reflecting intense investor interest in developer-focused inference and deployment platforms. OpenRouter, the popular LLM API aggregator, is also on a rapid growth trajectory. This funding wave highlights a major capital shift toward cost-effective, developer-friendly API and hosting solutions.
OpenRouter more than doubles valuation to $1.3B in a year
TechCrunch AI19 days agoBusiness
OpenRouter, an AI gateway startup founded in 2023, raised a $113 million Series B led by CapitalG. The round reportedly values the company at about $1.3 billion post-money, more than doubling from its estimated $547 million valuation after its June 2025 Series A. The company says it now offers access to over 400 models, has 8 million global users, and processes 100 trillion tokens per month.
AI 進入推論時代！AMD 蘇姿丰看好 CPU 市場年增 35%，架構將趨向「1:1」★ 75
INSIDE 硬塞 AI23 days agoOpinion
AMD CEO Lisa Su recently shared her latest views on the AI hardware market, pointing out that the AI industry is approaching a critical inflection point…
解鎖連續批次處理（Continuous Batching）中的非同步機制★ 75
Hugging Face Blog31 days agoRelease
As the demand for deploying large language models (LLMs) in production environments surges, how to improve inference efficiency and reduce costs has become a…
在 AWS 上進行基礎模型訓練與推理的建構基石 (Building Blocks)★ 75
Hugging Face Blog33 days agoTutorial
In the era of generative AI, training and deploying foundation models with billions of parameters faces enormous computational and architectural challenges…
DeepInfra 正式加入 Hugging Face 推理服務商（Inference Providers）陣容 🔥★ 72
Hugging Face Blog46 days agoRelease
Hugging Face's official blog has announced that DeepInfra — a well-known high-performance, low-cost serverless inference platform — has officially joined…
從第一性原理理解連續批處理（Continuous Batching）★ 80
Hugging Face Blog201 days agoTutorial
This technical blog post from Hugging Face takes a "First Principles" approach to provide a deep analysis of one of the most critical optimization techniques…
OVHcloud 正式加入 Hugging Face 推理供應商行列，主打歐洲數據主權與高性價比算力★ 72
Hugging Face Blog202 days agoRelease
Hugging Face has announced a new partnership with OVHcloud, Europe's leading cloud infrastructure provider, officially incorporating OVHcloud into Hugging Face…
只要三個簡單步驟，就能在 Intel CPU 上運行 VLM 視覺語言模型★ 70
Hugging Face Blog242 days agoTutorial
Visual Language Models (VLMs) combine computer vision with natural language processing, enabling complex tasks such as image captioning and visual question…
Scaleway 正式加入 Hugging Face 推理提供商（Inference Providers）🔥★ 70
Hugging Face Blog268 days agoRelease
Hugging Face has announced a deep partnership with Scaleway, a leading European cloud infrastructure provider, with Scaleway officially joining the Hugging…
Hugging Face 推理提供商迎來新夥伴：Public AI 正式上線 🔥★ 70
Hugging Face Blog270 days agoRelease
Hugging Face continues to expand its "Inference Providers" program, aimed at enabling developers to run open-source models from Hugging Face Hub in the…
使用 Torch Compile 快取加速模型啟動與推論速度★ 75
Replicate Blog279 days agoTutorial
When deploying modern AI models (such as LLaMA, Flux, or Stable Diffusion), `torch.compile` — introduced in PyTorch 2.0 — is a powerful performance…
在 Hugging Face 上透過 NVIDIA NIM 加速多樣化 LLM 部署★ 80
Hugging Face Blog328 days agoRelease
Hugging Face and NVIDIA have announced a new collaboration to bring NVIDIA NIM (NVIDIA Inference Microservices) into the Hugging Face ecosystem, with the goal…
Replicate 如何優化 FLUX.1 Kontext [dev]：深入解析 Taylor Seer 優化技術★ 75
Replicate Blog334 days agoTutorial
In the generative AI space, the FLUX.1 model developed by Black Forest Labs is renowned for its outstanding image quality and text rendering capabilities…

Page 1Next →

Latest in AI

NVIDIA Blackwell Leads First Agentic AI Infrastructure Benchmark★ 72

Production-Ready W4A8: vLLM Integration and Quality Recovery Techniques

Furiosa AI inference chip could be a game changer for local LLMs

Packed twin inference doubles Qwen3.6-27B throughput on one MI50

A llama.cpp CLI Command Builder

Pipeline parallelism in llama.cpp may be wasting your VRAM

Xiaomi Claims 1,000+ TPS on a 1T Model Using a Standard 8-GPU Server★ 72

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72

How the UK Is Turning Sovereign AI Ambition Into Action With NVIDIA Technologies★ 72

Qwen 3.6 27B KV Cache Quantization Benchmarks: KVarN, Turbo, and TCQ Evaluated

Launch HN: General Instinct (YC P26) - Frontier models on edge devices

Qualcomm Unveils Dragonfly Data Center Brand for the Agentic AI Era

After Nvidia’s $20B not-aqui-hire, AI chip startup Groq reportedly raising $650M

Xcena raises $135M betting AI’s bottleneck is memory, not compute

Protecting against inference theft

Has the hunt for AI compute uncovered the next Cerebras?

New AI Infra Decacorns: Fireworks, Baseten, and OpenRouter★ 78

OpenRouter more than doubles valuation to $1.3B in a year

AI 進入推論時代！AMD 蘇姿丰看好 CPU 市場年增 35%，架構將趨向「1:1」★ 75

解鎖連續批次處理（Continuous Batching）中的非同步機制★ 75

在 AWS 上進行基礎模型訓練與推理的建構基石 (Building Blocks)★ 75

DeepInfra 正式加入 Hugging Face 推理服務商（Inference Providers）陣容 🔥★ 72

從第一性原理理解連續批處理（Continuous Batching）★ 80

OVHcloud 正式加入 Hugging Face 推理供應商行列，主打歐洲數據主權與高性價比算力★ 72

只要三個簡單步驟，就能在 Intel CPU 上運行 VLM 視覺語言模型★ 70

Scaleway 正式加入 Hugging Face 推理提供商（Inference Providers）🔥★ 70

Hugging Face 推理提供商迎來新夥伴：Public AI 正式上線 🔥★ 70

使用 Torch Compile 快取加速模型啟動與推論速度★ 75

在 Hugging Face 上透過 NVIDIA NIM 加速多樣化 LLM 部署★ 80

Replicate 如何優化 FLUX.1 Kontext [dev]：深入解析 Taylor Seer 優化技術★ 75