Latest in AI

Showing:DevelopersQwenClear ×

🔥 Trending today

anthropic7 export-controls4 model-access3 spacex3 amazon3 national-security2 open-source2 governance2 ai-policy2 ai-regulation2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

AINews: Fable and Mythos Access Suspended Over Cybersecurity Risk★ 76
Latent SpaceyesterdayIncident
Anthropic’s Claude Fable 5 and Mythos 5 were abruptly suspended after a US export-control directive tied to a possible jailbreak and national cybersecurity risk. The roundup frames the event as a new “model sovereignty” warning for teams relying on closed frontier APIs. It also covers Kimi-K2.7-Code, MiniMax M3, DeepSWE replacing SWE-Bench Pro, agent-inference benchmarks, sandboxing, and Gemini-SQL2.
qwen3.6-27b Users Report Repeated Tool Call Loops
r/LocalLLaMA top day3 days agoIncident
A Reddit user on r/LocalLLaMA says qwen3.6-27b can fall into repeated tool-call loops during use. They report spending two days adjusting parameters such as temperature and top-k without resolving the issue. The post is a troubleshooting question rather than a confirmed bug report, asking whether other local model users have seen similar behavior.
Benchmarking Google Eloquent Exposes Major On-Device Dictation Reliability Issues
r/LocalLLaMA top day3 days agoBenchmark
A LocalLLaMA user tried to benchmark Google’s new fully local dictation app, Eloquent, against open ASR models such as Qwen3-ASR and NVIDIA Parakeet V3. The tester reported that roughly half of dictations returned only fragments, even during manual use. When Eloquent produced complete transcripts, its word error rate was competitive, but the missing-output behavior made the app unreliable for evaluation and practical use.
llama.cpp Merges MTP Optimization Removing Padding and Extra D2D Copies
r/LocalLLaMA top day4 days agoRelease
llama.cpp merged PR #24086, which changes ggml_gated_delta_net so MTP passes snapshot count K as an operation parameter instead of deriving it from tensor shape. The change removes a padding workaround and copies emitted snapshots into the recurrent cache with a single strided ggml_cpy. Benchmarks on DGX Spark with Qwen3.6-35B-A3B-UD-Q4_K_M.gguf showed about a 4% throughput gain, with wall time falling from 21.71s to 20.91s.
Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question
r/LocalLLaMA top day4 days agoBenchmark
A Reddit user is running Qwen3.6-MTP-27B-MTP in Q4_K_M GGUF format with llama.cpp server on a 32GB Tesla V100. They report one peak of 55 tokens per second, but typical throughput is closer to 44-48 TPS. The post asks whether flags such as parallelism, speculative MTP draft settings, KV cache quantization, flash attention, and a 262K context window are limiting performance without improving output quality.
How Useful Is qwopus Compared With Qwen3.6 27B for Coding?
r/LocalLLaMA top day4 days agoOpinion
A Reddit user on r/LocalLLaMA asks for practical comparisons between qwopus and Qwen3.6 27B, specifically for coding work. They note conflicting community opinions, with some users calling qwopus worse and others saying it is much better. In their own simple tests, they did not notice clear differences and want feedback from people using these models for agentic coding.
Without Open Source LLMs, US AI Companies Could Have Monopolized the Technology
r/LocalLLaMA top day4 days agoOpinion
This r/LocalLLaMA post argues that open-source LLMs are an ethical duty because AI has broad social impact. The author worries that without open models, US AI companies could have monopolized access and potentially limited availability to US firms. They also frame China’s release of powerful open-source LLMs as a contribution to humanity, despite political disagreements.
New to Local LLMs: Overwhelmed by Tool Choices, Model Naming, and Quantization
r/LocalLLaMA top day4 days agoTutorial
A first-time local LLM user installed ollama on Windows with gemma4 and qwen3.6, but quickly hit a wall of confusion around GUI tool selection, model size tradeoffs, and cryptic quantization naming like Q4_K_M and IQ4_XS. Despite owning high-end hardware (RTX 5090, 64GB DDR5, 9950X3D), the user lacks the foundational knowledge to make informed choices. The post highlights ongoing onboarding gaps in the local LLM ecosystem, where fragmented tooling and jargon-heavy documentation create steep barriers for newcomers.
OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization
r/LocalLLaMA top day4 days agoPaper
OSCAR applies offline-precomputed rotation matrices—derived from spectral covariance analysis—to reshape KV tensor distributions before 2-bit quantization, suppressing outliers and reducing rounding error. The rotation adds negligible inference overhead since it requires no runtime learning. GGUF downloads for Gemma-4-12B-it, Qwen3-32B, and Qwen3-4B-Thinking are available, with llama.cpp and sglang integrations and an arXiv paper.
Cohere North Mini Code 1.0
r/LocalLLaMA top day5 days agoRelease
CohereLabs’ North Mini Code 1.0 appears to have moved from early access to final release, with weights available on Hugging Face. The Reddit post describes it as a 30B A3B coding model. Its Artificial Analysis overall score of 28 trails Qwen 3.6 35B at 43, but its coding index score of 33 is close to Qwen’s 35 and above Gemma 4 26B’s 22.
Jetson Orin NX Build for Hermes Agent + Benchmarking
r/LocalLLaMA top day5 days agoHardware
The post describes turning an unused Jetson Orin NX into a compact local LLM server for Hermes Agent testing. The goals were low noise, over 10 tok/s generation, 300 tok/s prompt processing, at least 65K context, and a custom case. After testing Gemma 4, Qwen 3.6, and many quant variants, the author reports Gemma 4 26B A4B UD Q2_K_XL reaching 66K context and 10.21 tok/s near 60K context.
TinySearch v0.2.0: Lightweight Open Web-Search Tool for Local LLMs Now Defaults to SearXNG
r/LocalLLaMA top day5 days agoRelease
TinySearch is a lightweight open-source MCP/FastAPI tool that crawls, chunks, and reranks web results into an 8k-token context blob for small local LLMs. Version 0.2.0 replaces DuckDuckGo with SearXNG as the default backend after DDG began rate-limiting and CAPTCHAing automated requests. Users can point it at a self-hosted SearXNG instance; it integrates with Cline, Roo, and OpenCode agent setups.
ggml-webgpu improves prefill speeds for k-quants in llama.cpp PR
r/LocalLLaMA top day5 days agoBenchmark
llama.cpp PR #24225 improves ggml-webgpu matrix multiplication performance for k-quants and refactors matmul paths for Q4/Q5/Q8 and k-quants. In pp512 tests on an M2 Pro, reported speedups range from about 1.33x to 3.78x across Q2_K, Q3_K, Q4_K, Q5_K, and Q6_K. The largest gains appear on Q3_K models, including Qwen and Gemma examples.
Packed twin inference doubles Qwen3.6-27B throughput on one MI50
r/LocalLLaMA top day5 days agoBenchmark
A LocalLLaMA user shared an early packed-twin-inference experiment for local LLM acceleration. The idea resembles speculative decoding, but uses the same quantized model side-by-side instead of a smaller draft model. On a single AMD MI50, the author reports Qwen3.6-27B improving from 19.4 to 38.1 tk/s, with Q8-or-lower quantization as the main target.
JetBrains Mellum 2: a really good and performant model
r/LocalLLaMA top day5 days agoBenchmark
A r/LocalLLaMA user shared informal impressions of JetBrains Mellum 2, focusing on local coding-style tasks and tool calls. On an AMD Radeon RX 7900 XT with llama.cpp Vulkan and 131K context, the model reportedly generated around 111 tokens/s and stayed above 100 tokens/s near full context. The author stresses this is not a scientific benchmark, but a practical workflow-oriented test.
Omi Med STT v1: Open-Weight Medical ASR Fine-Tuned from Parakeet 0.6B★ 72
r/LocalLLaMA top day5 days agoRelease
Omi Health’s founder says he fine-tuned NVIDIA Parakeet TDT 0.6B v2 for clinical speech and released Omi Med STT v1 under CC-BY-4.0. The runtime supports Mac, Windows, and Linux, auto-selecting MLX, NeMo, or GGUF/parakeet.cpp backends. In the author’s held-out medical benchmark, it reports 2.37% medical-WER and 145× realtime on local A10 compute.
Pipeline parallelism in llama.cpp may be wasting your VRAM
r/LocalLLaMA top day5 days agoBenchmark
The author compared three llama.cpp Vulkan builds: default 4 sched copies, 1 sched copy, and no pipeline parallelism. In their Qwen GGUF test, input and output throughput were nearly identical across all configurations. However, the default setting used about 1.5GB more VRAM for compute buffers and reduced usable context from roughly 113K tokens to around 88K, though parallel-request benefits were not tested.
Arguing with an AI bot posting outdated Llama 3.1 takes
r/LocalLLaMA top day5 days agoCommentary
A r/LocalLLaMA post jokes about arguing with an AI bot that posted outdated commentary involving Llama 3.1. The author says such bots should enable web search instead of relying on stale knowledge. The post also mocks exaggerated model testimonial posts, using Qwen3.6 27B as a sarcastic example, making it more of a community quality complaint than technical news.
Qwen3.6-35B-A3B Tool Calling Benchmark: ByteShape vs Unsloth GGUFs
r/LocalLLaMA top day5 days agoBenchmark
The post benchmarks eight Qwen3.6-35B-A3B GGUF quants from ByteShape and Unsloth using llama.cpp and tool-eval-bench. It compares f16, q8_0, and q4_0 KV cache quantization under short and long-context pressure, totaling 144 runs and roughly 300 GPU-hours. The author reports no clear ByteShape versus Unsloth winner, q8_0 as close to a free lunch, q4_0 as weaker, and long context as a major tool-calling degradation factor.
LocalLLaMA post tier list
r/LocalLLaMA top day6 days agoOpinion
The author proposes a tier list for r/LocalLLaMA posts in response to complaints about declining post quality. Top-tier posts include new local model releases with GGUF/MLX or benchmark data, meaningful optimizations, complete hardware performance reports, and well-analyzed research. Low-tier posts include repeated toy benchmarks, unrelated cloud AI chatter, AI-generated slop, and thinly disguised ads for Claude-wrapper startups.
An Implementation of NanoQuant: A Flexible Binary Quantization Method
r/LocalLLaMA top day6 days agoNew Tool
A r/LocalLLaMA post presents an unofficial PyTorch implementation of NanoQuant, a 2026 post-training quantization method for dense transformers. The method factorizes weights into scaling vectors and binary matrices, then quantizes and fine-tunes blocks sequentially to reduce hardware requirements. Early Qwen3-0.6B and Qwen3-4B experiments are promising for base models, but instruct quality remains weak and highly dependent on calibration data.
Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72
r/LocalLLaMA top day6 days agoNew Tool
Luce Spark is an open-source MoE offload system for running 33B-35B A3B models on 16GB-class GPUs. It keeps frequently routed experts on GPU, stores the long tail in system RAM, and swaps cold experts through a bounded async cache. The author reports 13.3 GiB for Qwen3.6 35B-A3B and about 100 tok/s with Spark optimizations, but notes real 16GB GPU testing is still missing.
[3090] Gemma4 QAT + MTP quick TPS numbers
r/LocalLLaMA top day6 days agoBenchmark
A r/LocalLLaMA user shared quick throughput numbers for Gemma4 QAT with MTP speculative decoding on an RTX 3090 24GB setup. They report roughly 1.2-1.8x TPS improvement, with Gemma 4 31B moving from about 40 tok/s to 70-80 tok/s. The author frames this as a rough benchmark, using 11 task categories and noting stochastic variation from temp 1.0.
mtmd adds video input support in llama.cpp★ 72
r/LocalLLaMA top day6 days agoRelease
ggml-org/llama.cpp merged PR #24269, adding video input support to mtmd through mtmd-cli and /chat/completions, which also enables the web UI path. The implementation invokes a locally installed ffmpeg subprocess instead of bundling codec support, and currently extracts visual frames only, with no audio support yet. It was tested with Qwen3-VL-2B in CLI and Gemma 4 E4B in web UI, making local multimodal video experiments more accessible.
Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem
Hugging Face Blog6 days agoNew Tool
Pakistan Notice Helper is a Build Small Hackathon project focused on suspicious notices in Pakistan, including bank, courier, tax, telecom, police, and government-style messages. It accepts text or screenshots, supports English and Urdu, and returns risk labels, red flags, explanations, and safer next steps. The author discusses choosing Qwen3.5 4B Q8 with llama.cpp, Modal, Gradio, and Hugging Face Spaces after balancing quality, cost, latency, cold starts, and safety constraints.
Leanstral: Open-Source Foundation for Trustworthy Vibe-Coding★ 76
Mistral AI News6 days agoRelease
Mistral AI introduced Leanstral, an open-source code agent designed for Lean 4 and formal proof engineering. The model is available through Apache 2.0 weights, Mistral Vibe, and a Labs API endpoint. Mistral positions it as a cost-efficient alternative for verified coding workflows, with FLTEval benchmarks comparing it against Claude family models and large open-source competitors.
Introducing Mistral Small 4★ 76
Mistral AI News6 days agoRelease
Mistral AI introduced Mistral Small 4 as the next major release in the Mistral Small family. It combines reasoning, multimodal, and agentic coding capabilities into one open model with configurable reasoning effort. The model uses a MoE architecture, supports a 256k context window and text-image inputs, and is available through Mistral API, AI Studio, Hugging Face, NVIDIA NIM, and common inference stacks.
Remote agents in Vibe, powered by Mistral Medium 3.5★ 78
Mistral AI News6 days agoNew Tool
Mistral Medium 3.5 is a 128B dense model in public preview, combining instruction-following, reasoning, and coding with a 256k context window. It becomes the default model for Le Chat and Mistral Vibe. Vibe now supports remote coding agents that run asynchronously in the cloud, while Le Chat adds Work mode for longer multi-step tasks across connected tools.
Introducing Mistral Small 4★ 78
Mistral AI News6 days agoRelease
Mistral Small 4 is the next major release in the Mistral Small family, unifying Magistral-style reasoning, Pixtral-style multimodality, and Devstral-style coding agents. It uses a MoE architecture with 119B total parameters, 6B active parameters per token, a 256k context window, and configurable reasoning effort. The model is available via Mistral API, AI Studio, Hugging Face, open-source serving stacks, and NVIDIA deployment options.
Qwen3.7-Plus launches as a multimodal agent base for recreating desktop software
量子位 QbitAI6 days agoRelease
QbitAI’s headline says Qwen3.7-Plus has launched and positions it as a new foundation for multimodal agents. The highlighted capability is one-click recreation of professional desktop software, suggesting UI understanding and app-generation workflows. Since no article body is available, technical details, availability, benchmarks, licensing, and real-world reliability cannot be verified from the provided source.

Page 1Next →

Latest in AI

AINews: Fable and Mythos Access Suspended Over Cybersecurity Risk★ 76

qwen3.6-27b Users Report Repeated Tool Call Loops

Benchmarking Google Eloquent Exposes Major On-Device Dictation Reliability Issues

llama.cpp Merges MTP Optimization Removing Padding and Extra D2D Copies

Qwen3.6-MTP-27B on Tesla V100: llama.cpp Throughput Tuning Question

How Useful Is qwopus Compared With Qwen3.6 27B for Coding?

Without Open Source LLMs, US AI Companies Could Have Monopolized the Technology

New to Local LLMs: Overwhelmed by Tool Choices, Model Naming, and Quantization

OSCAR RotationZoo - Offline Spectral Covariance-Aware Rotation for 2-bit KV Cache Quantization

Cohere North Mini Code 1.0

Jetson Orin NX Build for Hermes Agent + Benchmarking

TinySearch v0.2.0: Lightweight Open Web-Search Tool for Local LLMs Now Defaults to SearXNG

ggml-webgpu improves prefill speeds for k-quants in llama.cpp PR

Packed twin inference doubles Qwen3.6-27B throughput on one MI50

JetBrains Mellum 2: a really good and performant model

Omi Med STT v1: Open-Weight Medical ASR Fine-Tuned from Parakeet 0.6B★ 72

Pipeline parallelism in llama.cpp may be wasting your VRAM

Arguing with an AI bot posting outdated Llama 3.1 takes

Qwen3.6-35B-A3B Tool Calling Benchmark: ByteShape vs Unsloth GGUFs

LocalLLaMA post tier list

An Implementation of NanoQuant: A Flexible Binary Quantization Method

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72

[3090] Gemma4 QAT + MTP quick TPS numbers

mtmd adds video input support in llama.cpp★ 72

Building Pakistan Notice Helper: A Small AI Tool for a Very Local Safety Problem

Leanstral: Open-Source Foundation for Trustworthy Vibe-Coding★ 76

Introducing Mistral Small 4★ 76

Remote agents in Vibe, powered by Mistral Medium 3.5★ 78

Introducing Mistral Small 4★ 78

Qwen3.7-Plus launches as a multimodal agent base for recreating desktop software