Google’s DiffusionGemma is an Apache 2.0 experimental open model using text diffusion instead of standard autoregressive decoding. The 26B MoE model activates 3.8B parameters during inference and is designed for low-latency local workflows. Google claims up to 4x faster generation on dedicated GPUs, while noting that output quality is below standard Gemma 4 and production-quality use cases should still prefer Gemma 4.
Intel presented the Arc Pro B70 GPU at MPTS2026 as a professional GPU for AI-assisted media creation and teaching labs. The article highlights 32GB GDDR6 memory, second-gen Xe² architecture, 32 Xe cores, XMX acceleration, and up to 367 TOPS INT8 performance. Lenovo ThinkStation workstations and GUNNIR’s Arc Pro B70 TF 32G are positioned as ecosystem solutions for local AIGC, rendering, virtual production, and data-sensitive education deployments.
A r/LocalLLaMA post says a Bilibili creator has shown a single-slot, half-height PCIe V100 with NVLink on a custom PCB. The card is described as 16 cm long, passively cooled by default, capped at 75W, with another version supporting up to 300W. The 16GB model is expected around or below ¥1500, with a 32GB version reportedly planned, but it is not yet available for purchase.
A Reddit user reminds the local LLM community that throttling GPU power limits offers outsized energy savings with minimal performance cost. On dual Radeon VII cards, cutting power from 250W to 100W per card resulted in less than 10% drop in inference speed. LLM inference is memory-bound rather than compute-bound, making it uniquely tolerant of reduced GPU clock speeds compared to training or rendering tasks.
Xiaomi announced MiMo-V2.5-Pro-UltraSpeed with TileRT, claiming over 1,000 tokens/s decode speed on a 1-trillion-parameter MoE model. The company says it runs on a single standard 8-GPU commodity node, not wafer-scale or SRAM-heavy specialized hardware. The claimed stack combines FP4 MoE expert quantization, DFlash speculative decoding, and TileRT low-latency inference kernels, but independent validation is still needed.
AMD CEO Lisa Su recently shared her latest views on the AI hardware market, pointing out that the AI industry is approaching a critical inflection point…
Hugging Face's official blog has announced exciting news for the open-source AI community: Hugging Face has formed a deep partnership with Unsloth — the…
### Background and Challenge: Why Is CUDA Programming So Hard for AI? CUDA (Compute Unified Device Architecture) is a parallel computing platform and…
Hugging Face recently announced a major update for AMD GPU users and developers, aimed at simplifying the process of building, packaging, and sharing ROCm…
Against the backdrop of explosive global growth in artificial intelligence, compute has become the core resource that determines technological competitiveness…
Hugging Face has announced a deep partnership with Scaleway, a leading European cloud infrastructure provider, with Scaleway officially joining the Hugging…
As the architecture and scale of deep learning models (such as large language models, or LLMs) continue to expand, standard PyTorch operators sometimes fall…
As AMD Instinct MI300 series GPUs (such as the MI300X) gradually increase their market share in the AI compute market, how to perform low-level optimization…
Hugging Face officially announced a partnership with Featherless AI, a serverless GPU inference platform, integrating it into the Hugging Face Inference…
Hugging Face has announced a new partnership with AI chip giant NVIDIA, launching "Training Cluster as a Service" (TCaaS). The introduction of this service…
Replicate, the well-known AI model cloud hosting platform, has announced that it is officially introducing and supporting NVIDIA H100 GPUs within its…
After a week that was expected to potentially be turbulent but turned out to be quite calm, the latest issue of AINews briefly declares that "nothing major…
On February 18, 2025, Hugging Face announced the addition of three new partners to its serverless inference ecosystem: Hyperbolic, Nebius AI Studio, and Novita…
The AI deployment platform Replicate has announced the official availability of NVIDIA L40S GPU compute on its platform. This update aims to provide developers…
Hugging Face and NVIDIA announced a major partnership in late July 2024, officially launching a serverless inference service powered by NVIDIA NIM (NVIDIA…
Replicate has published its technical newsletter, Replicate Intelligence #4, summarizing recent major developments in the AI field as well as the latest…
The official blog of Replicate, the popular AI model hosting and deployment platform, has announced that NVIDIA H100 Tensor Core GPUs will soon be officially…
Hugging Face has announced a deep partnership with NVIDIA to directly integrate NVIDIA DGX Cloud services into the Hugging Face platform. This collaboration…
Hugging Face announced the launch of a new open-source library called "Optimum-NVIDIA," the result of a deep collaboration with NVIDIA, aimed at seamlessly…
Hugging Face's official blog announced a deep partnership with chip giant AMD, launching `optimum-amd`, an open-source library optimized specifically for AMD…
This Hugging Face official blog post introduces a major update that integrates AutoGPTQ into the `transformers` and `optimum` libraries. GPTQ (Generalized…
In June 2023, Hugging Face officially announced a long-term strategic partnership with chip giant AMD. The core objective of this collaboration is to optimize…
Hugging Face officially announced a new platform pricing structure, designed to provide more flexible and affordable options for community members…