Latest in AI

Showing:ResearchersOpen-sourceClear ×

🔥 Trending today

anthropic7 export-controls5 model-access3 ai-infrastructure3 spacex3 amazon3 national-security2 open-source2 governance2 ai-policy2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

AINews: Fable and Mythos Access Suspended Over Cybersecurity Risk★ 76
Latent SpaceyesterdayIncident
Anthropic’s Claude Fable 5 and Mythos 5 were abruptly suspended after a US export-control directive tied to a possible jailbreak and national cybersecurity risk. The roundup frames the event as a new “model sovereignty” warning for teams relying on closed frontier APIs. It also covers Kimi-K2.7-Code, MiniMax M3, DeepSWE replacing SWE-Bench Pro, agent-inference benchmarks, sandboxing, and Gemini-SQL2.
Open Source AI Must Win
Hacker News (AI keywords)yesterdayOpinion
With no article body provided, the only supported reading is that this is an opinion piece advocating for open source AI. The title frames open source AI not merely as one option among many, but as something that “must win.” It likely targets readers interested in AI governance, developer ecosystems, model access, and competition, but no specific claims or evidence are available.
olmo-eval: An Evaluation Workbench for the Model Development Loop
Hugging Face Blog2 days agoNew Tool
The Hugging Face Blog post announces olmo-eval, described as an evaluation workbench for the model development loop. Based on the title alone, the project appears focused on helping teams evaluate models during iterative development rather than only after release. No article body was provided, so specific features, supported benchmarks, integrations, metrics, or usage details cannot be confirmed.
Open Reproduction of DeepSeek-R1
Hacker News (AI keywords)3 days agoRelease
The linked item is a GitHub project titled “Open Reproduction of DeepSeek-R1,” with no article body provided. From the title alone, it appears to be an effort to recreate or document DeepSeek-R1 in an open manner. The main relevance is for researchers and ML engineers interested in reproducible reasoning-model training, evaluation, and open-source alternatives.
Silia: A Tiny Transformer Architecture for Sub-10M Parameter Models
r/LocalLLaMA top day3 days agoPaper
A student from India shared their first paper on r/LocalLLaMA, proposing Silia, a Transformer architecture for extremely small models. The idea is to merge attention-style dynamic mixing with SwiGLU-like nonlinear transformation, aiming to save parameters in models under roughly 10M parameters. The author frames the work as an early, small-scale exploration, limited by old hardware and restricted access to larger compute.
NVIDIA Releases NVFP4-Quantized DiffusionGemma 26B A4B IT on Hugging Face
r/LocalLLaMA top day3 days agoRelease
NVIDIA has released DiffusionGemma 26B A4B IT NVFP4 on Hugging Face, a quantized version of Google DeepMind's open-weights multimodal model. Built on a Mixture-of-Experts architecture with 25.2B total but only 3.8B active parameters, it generates text in parallel 256-token blocks using discrete diffusion, exceeding 1,100 tokens per second on H100 hardware. The model supports a 256K-token context, text/image/video inputs, native function calling, reasoning mode, and 35+ languages.
[AINews] Open Models, Model Labs vs Agent Labs, and the Untrainable★ 72
Latent Space3 days agoCommentary
This AINews issue uses Sarah Guo’s essay as a lens for current AI industry debates: where open models matter, how agent labs differ from model labs, and what cannot be trained away. It also recaps discourse around Anthropic Fable/Mythos, Fable 5’s capabilities, Google’s DiffusionGemma, and maturing agent infrastructure. The central takeaway is that durable value may lie in integration, customer translation, maintenance, and intent rather than model scores alone.
Offline CPU Voice Loop for Ollama and LM Studio Agents
r/LocalLLaMA top day3 days agoNew Tool
A r/LocalLLaMA post introduces an offline voice loop for talking to local models through Ollama, LM Studio, or vLLM. The stack uses Silero VAD, Parakeet TDT 0.6B v3 STT, and Supertonic TTS 3, all running on CPU so GPU memory stays available for the LLM. The author reports measured CPU-only benchmarks, agent integrations, cross-platform installers, and an MIT-licensed GitHub release.
AMD Highlights Unified Memory Architecture for Future AI Systems
r/LocalLLaMA top day3 days agoHardware
A Reddit post in r/LocalLLaMA links to coverage of AMD discussing unified memory architecture and its role in future product roadmaps. The post says AMD believes UMA could help shape next-generation architectures and notes Ryzen AI MAX 400 series systems, also referred to by the community as Gorgon Halo. It frames the topic as part of an ongoing LocalLLaMA discussion about whether unified-memory x86 systems could matter for local AI workloads.
qwen3.6-27b Users Report Repeated Tool Call Loops
r/LocalLLaMA top day3 days agoIncident
A Reddit user on r/LocalLLaMA says qwen3.6-27b can fall into repeated tool-call loops during use. They report spending two days adjusting parameters such as temperature and top-k without resolving the issue. The post is a troubleshooting question rather than a confirmed bug report, asking whether other local model users have seen similar behavior.
Benchmarking Google Eloquent Exposes Major On-Device Dictation Reliability Issues
r/LocalLLaMA top day3 days agoBenchmark
A LocalLLaMA user tried to benchmark Google’s new fully local dictation app, Eloquent, against open ASR models such as Qwen3-ASR and NVIDIA Parakeet V3. The tester reported that roughly half of dictations returned only fragments, even during manual use. When Eloquent produced complete transcripts, its word error rate was competitive, but the missing-output behavior made the app unreliable for evaluation and practical use.
DiffusionGemma: Google Launches High-Speed Open-Weight Gemma Diffusion Model★ 76
Simon Willison's Weblog3 days agoRelease
Simon Willison highlights Google’s new DiffusionGemma, an Apache 2 licensed open-weight Gemma model. He connects it to last year’s brief Gemini Diffusion preview, which he measured at 857 tokens per second. NVIDIA is currently hosting the model for free on its NIM cloud API, where Willison generated 2,409 tokens in 4.4 seconds, implying at least 500 tokens per second.
Google DeepMind Releases DiffusionGemma: Open Source Model with 4x Local AI Execution Speed Improvement
Ars Technica AI3 days agoRelease
Google DeepMind has released DiffusionGemma, an open-source model that brings diffusion-based generation to text tasks. Unlike autoregressive LLMs that generate one token at a time, diffusion models can produce outputs in parallel, dramatically cutting latency. The result is reportedly a 4x speed improvement for local AI inference, making on-device deployment significantly more practical.
LocalLLaMA User Weighs QAT Gemma 31B GGUF Quants for RTX 3060
r/LocalLLaMA top day3 days agoCommentary
A Reddit user with an RTX 3060 12GB and 32GB DDR3 RAM is evaluating new QAT-based Gemma 31B GGUF quantizations. They currently run an older Unsloth Gemma 31B IQ3_XXS build at long context, with some tensor and mmproj offloading to CPU. The post asks which Q2-Q3 quant to choose, whether QAT changes quality expectations, and whether MTP would help or hurt under tight VRAM limits.
πfs: the data-free filesystem that “stores” data in π
Hacker News (AI keywords)3 days agoNew Tool
πfs is an open-source FUSE-style filesystem built around a deliberately absurd idea: data does not need to be stored if it can be located in pi. It records metadata such as file names and positions in pi, then reconstructs content from those locations. The project is more technical humor and conceptual demonstration than practical storage or AI tooling.
llama.cpp Merges MTP Optimization Removing Padding and Extra D2D Copies
r/LocalLLaMA top day3 days agoRelease
llama.cpp merged PR #24086, which changes ggml_gated_delta_net so MTP passes snapshot count K as an operation parameter instead of deriving it from tensor shape. The change removes a padding workaround and copies emitted snapshots into the recurrent cache with a single strided ggml_cpy. Benchmarks on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf showed about a 4% throughput gain, with wall time falling from 21.71s to 20.91s.
DiffusionGemma: 4x faster text generation★ 74
Google DeepMind Blog4 days agoRelease
Google’s DiffusionGemma is an Apache 2.0 experimental open model using text diffusion instead of standard autoregressive decoding. The 26B MoE model activates 3.8B parameters during inference and is designed for low-latency local workflows. Google claims up to 4x faster generation on dedicated GPUs, while noting that output quality is below standard Gemma 4 and production-quality use cases should still prefer Gemma 4.
Lemonade v10.7 Adds Omni Models, Benchmarks, and Cross-Vendor GPU Support
r/LocalLLaMA top day4 days agoRelease
Lemonade v10.7 marks a project-level shift toward working-group-driven development, with 19 contributors involved in the release. The update improves LMX-Omni virtual models for Open WebUI and OpenAI-compatible multimedia clients, introduces the `lemonade bench` CLI, and expands backend support. CUDA, Vulkan, llama.cpp, stable-diffusion.cpp, FastFlowLM, and vLLM are part of the broader push toward cross-vendor local AI performance.
NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI
NVIDIA Blog4 days agoRelease
Google DeepMind released DiffusionGemma, an experimental open model built for fast text generation. NVIDIA says it optimized the model for GeForce RTX GPUs, RTX PRO platforms, and DGX Spark systems. Instead of generating text one word at a time, DiffusionGemma produces multiple words in parallel to reduce latency for single-user workloads.
DiffusionGemma: The Developer Guide — Google Developers Blog
r/LocalLLaMA top day4 days agoTutorial
Google has released a comprehensive developer guide for DiffusionGemma, a text-generation model that uses masked diffusion rather than autoregressive next-token prediction. Unlike standard Gemma models, DiffusionGemma iteratively denoises a fully masked sequence to produce output, enabling a fundamentally different generation paradigm. The guide targets developers looking to integrate or experiment with diffusion-based LLMs using Google's tooling.
DiffusionGemma: 4x Faster Text Generation★ 76
Hacker News (AI keywords)4 days agoRelease
Google released DiffusionGemma, a 26B MoE experimental open model using text diffusion instead of token-by-token autoregressive decoding. It can generate blocks of text in parallel, reaching up to 4x faster output on dedicated GPUs. The model targets local, speed-sensitive workflows, but Google says its output quality is below standard Gemma 4 and recommends Gemma 4 for quality-critical production use.
SenseNova U1 Adds an Infographic-Specific Fine-Tune
r/LocalLLaMA top day4 days agoRelease
A Reddit post highlights a new infographic-specific fine-tune for SenseNova U1-8B-MoT, trained with an extended multi-task phase for structured visual output. The reported benchmarks show large gains in IGenBench infographic accuracy and chart understanding, with smaller improvement in text rendering. Aesthetic score appears roughly unchanged, suggesting the update mainly improves information structure and visual reasoning rather than overall visual polish.
Bonsai LM 1-bit and 1.58-bit Benchmarks on Jetson Orin Nano Super
r/LocalLLaMA top day4 days agoBenchmark
A LocalLLaMA post benchmarks five Bonsai LM models, from 1.7B to about 8B parameters, on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA. The tests compare 7W, 15W, 25W, and MAXN modes across latency, throughput, energy per token, and thermals. The main takeaway is that 25W is usually the best efficiency/performance point for models up to 4B, while Bonsai-8B may favor 15W for lower power.
MooreThreads Releases MusaCoder-27B Code LLM on Hugging Face
r/LocalLLaMA top day4 days agoRelease
MooreThreads, a Chinese GPU semiconductor company best known for its MUSA compute platform, has released MusaCoder-27B on Hugging Face alongside a technical paper on arXiv. The 27B-parameter model is positioned as a code-generation LLM, extending MooreThreads' ambitions beyond hardware into the AI model layer. Its public availability on Hugging Face signals an open-weights approach, making it accessible to local-inference practitioners and researchers evaluating alternatives to Western-origin coding models.
OpenLumara Creator Challenges Reddit to Hack Its Public Agent Instance
r/LocalLLaMA top day4 days agoIncident
The creator of OpenLumara posted a public challenge asking r/LocalLLaMA users to try breaking into a Discord-hosted instance of the local-model agent. They claimed common prompt-engineering attacks would not work because modules and sandboxes were heavily locked down. The post later listed several successful findings, including missing path traversal protection, an authorization-check bypass, and another undisclosed exploit pending a fix.
Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question
r/LocalLLaMA top day4 days agoBenchmark
A Reddit user is running Qwen3.6-MTP-27B-MTP in Q4_K_M GGUF format with llama.cpp server on a 32GB Tesla V100. They report one peak of 55 tokens per second, but typical throughput is closer to 44-48 TPS. The post asks whether flags such as parallelism, speculative MTP draft settings, KV cache quantization, flash attention, and a 262K context window are limiting performance without improving output quality.
How Useful Is qwopus Compared With Qwen3.6 27B for Coding?
r/LocalLLaMA top day4 days agoOpinion
A Reddit user on r/LocalLLaMA asks for practical comparisons between qwopus and Qwen3.6 27B, specifically for coding work. They note conflicting community opinions, with some users calling qwopus worse and others saying it is much better. In their own simple tests, they did not notice clear differences and want feedback from people using these models for agentic coding.
Charting Local LLM Releases: 2025 Was the Peak, Not 2026
r/LocalLLaMA top day4 days agoCommentary
A r/LocalLLaMA community member shared visualizations tracking the volume of local LLM releases over time. Contrary to the perception that 2026 has been an unusually prolific year, the data indicates the actual release peak occurred in 2025. The poster attributes the misperception to the outsized quality improvements in 2026 making it feel more eventful than it quantitatively was.
Gemma 4 12B Unified Audio Loses Speech Attention with Large System Prompts
r/LocalLLaMA top day4 days agoCommentary
A developer building a single-pass voice assistant with Gemma 4 12B unified (encoder-free audio/vision/text model) finds that audio attention collapses once the system prompt grows to ~21k tokens. The model then ignores or hallucinates instead of responding to the spoken input. The issue reproduces identically on vLLM, llama.cpp, and LiteRT-LM, pointing to an architectural attention-saturation limit rather than a stack-specific bug.
Without Open Source LLMs, US AI Companies Could Have Monopolized the Technology
r/LocalLLaMA top day4 days agoOpinion
This r/LocalLLaMA post argues that open-source LLMs are an ethical duty because AI has broad social impact. The author worries that without open models, US AI companies could have monopolized access and potentially limited availability to US firms. They also frame China’s release of powerful open-source LLMs as a contribution to humanity, despite political disagreements.

Page 1Next →

Latest in AI

AINews: Fable and Mythos Access Suspended Over Cybersecurity Risk★ 76

Open Source AI Must Win

olmo-eval: An Evaluation Workbench for the Model Development Loop

Open Reproduction of DeepSeek-R1

Silia: A Tiny Transformer Architecture for Sub-10M Parameter Models

NVIDIA Releases NVFP4-Quantized DiffusionGemma 26B A4B IT on Hugging Face

[AINews] Open Models, Model Labs vs Agent Labs, and the Untrainable★ 72

Offline CPU Voice Loop for Ollama and LM Studio Agents

AMD Highlights Unified Memory Architecture for Future AI Systems

qwen3.6-27b Users Report Repeated Tool Call Loops

Benchmarking Google Eloquent Exposes Major On-Device Dictation Reliability Issues

DiffusionGemma: Google Launches High-Speed Open-Weight Gemma Diffusion Model★ 76

Google DeepMind Releases DiffusionGemma: Open Source Model with 4x Local AI Execution Speed Improvement

LocalLLaMA User Weighs QAT Gemma 31B GGUF Quants for RTX 3060

πfs: the data-free filesystem that “stores” data in π

llama.cpp Merges MTP Optimization Removing Padding and Extra D2D Copies

DiffusionGemma: 4x faster text generation★ 74

Lemonade v10.7 Adds Omni Models, Benchmarks, and Cross-Vendor GPU Support

NVIDIA Accelerates Google DeepMind’s DiffusionGemma for Local AI

DiffusionGemma: The Developer Guide — Google Developers Blog

DiffusionGemma: 4x Faster Text Generation★ 76

SenseNova U1 Adds an Infographic-Specific Fine-Tune

Bonsai LM 1-bit and 1.58-bit Benchmarks on Jetson Orin Nano Super

MooreThreads Releases MusaCoder-27B Code LLM on Hugging Face

OpenLumara Creator Challenges Reddit to Hack Its Public Agent Instance

Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question

How Useful Is qwopus Compared With Qwen3.6 27B for Coding?

Charting Local LLM Releases: 2025 Was the Peak, Not 2026

Gemma 4 12B Unified Audio Loses Speech Attention with Large System Prompts

Without Open Source LLMs, US AI Companies Could Have Monopolized the Technology