Latest in AI

Showing:local-llmClear ×

🔥 Trending today

anthropic7 export-controls4 model-access3 spacex3 amazon3 national-security2 open-source2 governance2 ai-policy2 ai-regulation2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

Was BitNet a dead end? What happened to ternary LLMs?
r/LocalLLaMA top day5 days agoCommentary
A r/LocalLLaMA user questions whether BitNet and ternary LLMs were a dead end after earlier promise around efficient low-bit models. The post notes that the largest ternary model appears to remain around 2B parameters. It asks why frontier open-weight AI labs are not visibly pursuing the approach, but provides no technical evidence or definitive answer.
When every other post is an AI benchmark, best-model question, or slop app
r/LocalLLaMA top day6 days agoCommentary
This r/LocalLLaMA post is a meme-like complaint about the subreddit’s recent content quality. The author points to repeated AI-generated benchmark reports, recurring “best model” questions, and hastily built apps or engines presented as groundbreaking. It is not a technical release or evidence-based analysis, but it reflects frustration with noise, hype, and low-effort AI-generated discussion in local model communities.
LocalLLaMA post urges users not to join SpaceX, OpenAI, Anthropic IPOs
r/LocalLLaMA top day6 days agoOpinion
A popular r/LocalLLaMA post urges local LLM supporters not to invest in IPOs tied to SpaceX, OpenAI, or Anthropic. The author argues that frontier labs drive up demand and prices for GPUs, RAM, SSDs, HDDs, and NAS hardware, making local inference harder. The post also questions AI company valuations, but its claims are mostly opinion and speculation without cited evidence.
An Implementation of NanoQuant: A Flexible Binary Quantization Method
r/LocalLLaMA top day6 days agoNew Tool
A r/LocalLLaMA post presents an unofficial PyTorch implementation of NanoQuant, a 2026 post-training quantization method for dense transformers. The method factorizes weights into scaling vectors and binary matrices, then quantizes and fine-tunes blocks sequentially to reduce hardware requirements. Early Qwen3-0.6B and Qwen3-4B experiments are promising for base models, but instruct quality remains weak and highly dependent on calibration data.
I bundled a fully local LLM inside my Unity game
r/LocalLLaMA top day6 days agoRelease
A developer shared a Unity game, Simulation Simulator, that bundles a local LLM with no internet, cloud service, or API key required. The game is a campfire chat simulator about DMT, simulation theory, and a monitor-headed friend, with five endings driven by natural AI interaction. The author sees this as a path toward richer NPCs, while noting local TTS and translation are still too slow for smooth gameplay.
Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72
r/LocalLLaMA top day6 days agoNew Tool
Luce Spark is an open-source MoE offload system for running 33B-35B A3B models on 16GB-class GPUs. It keeps frequently routed experts on GPU, stores the long tail in system RAM, and swaps cold experts through a bounded async cache. The author reports 13.3 GiB for Qwen3.6 35B-A3B and about 100 tok/s with Spark optimizations, but notes real 16GB GPU testing is still missing.
[3090] Gemma4 QAT + MTP quick TPS numbers
r/LocalLLaMA top day6 days agoBenchmark
A r/LocalLLaMA user shared quick throughput numbers for Gemma4 QAT with MTP speculative decoding on an RTX 3090 24GB setup. They report roughly 1.2-1.8x TPS improvement, with Gemma 4 31B moving from about 40 tok/s to 70-80 tok/s. The author frames this as a rough benchmark, using 11 task categories and noting stochastic variation from temp 1.0.
Gemma 4 Chat Template now has preserve thinking
r/LocalLLaMA top day6 days agoRelease
A r/LocalLLaMA post notes that Gemma 4’s chat template now has “preserve thinking.” The linked discussion points to google/gemma-4-31B-it on Hugging Face, suggesting a template-level change rather than a new model release or benchmark. The original post does not provide detailed usage notes, defaults, compatibility information, or measured effects.
What was your local daily driver for coding last week?
r/LocalLLaMA top day6 days agoCommentary
This r/LocalLLaMA post is a brief community poll asking users what their local coding daily driver was last week. The post asks commenters to share their favorite model and quant, but the provided text does not include poll options, results, or specific model names. Its value is mainly as a community signal for tracking local LLM coding preferences.
llama.cpp PR #24277 avoids KV cell copies in kv-cache
r/LocalLLaMA top day6 days agoRelease
ggml-org/llama.cpp merged PR #24277 by ggerganov, titled “kv-cache: avoid kv cells copies.” The Reddit post says the change improves MTP performance for Gemma-4 and was merged the previous day. It is available starting with the b9551 release, making it relevant for local inference users tracking llama.cpp performance updates.
Thoughts on Gemma4 12B vs 26A4B: Which Is Better?
r/LocalLLaMA top day6 days agoOpinion
The post asks the LocalLLaMA community to compare Gemma4 12B and 26A4B, explicitly excluding the 31B model from discussion. The user is mainly interested in creative tasks, writing, and chatting, with coding treated as optional rather than central. No benchmarks or examples are provided, so the post is best read as a model-selection question about subjective quality and practical use.
Google's Official Gemma 4 QAT Q4_0 GGUFs Have Higher Precision Than Unsloth's Q4_K_XL
r/LocalLLaMA top day6 days agoCommentary
An analysis of Gemma 4 QAT GGUF files reveals that Google's official 'Q4_0' releases actually employ a mixed-precision strategy. For smaller models like E2B and E4B, Google keeps critical token embeddings in Q6_K and certain projection weights in F16. This makes Google's Q4_0 files larger and more precise than Unsloth's 'Q4_K_XL' versions, which default to standard Q4_0 for almost all tensors.
Gemma 4 31B FP8 Matches Claude Sonnet 4.6 Medium in Custom Benchmark★ 75
r/LocalLLaMA top day6 days agoBenchmark
A Reddit user shared benchmark results showing Google's Gemma 4 31B (FP8) performing on par with Claude Sonnet 4.6 Medium. The custom evaluation harness tested complex tasks including Neo4j Cypher queries, entity extraction, agentic tool calling, Python coding, and multi-vector retrieval synthesis. This highlights how quantized mid-sized open-source models are closing the gap with leading proprietary frontier models.
User Shares Gemma 4 QAT Experience: Improved Quality and MTP Speedups
r/LocalLLaMA top day6 days agoOpinion
A Reddit user shared their experience with the Gemma 4 31B QAT (Quantization-Aware Training) model. Compared to traditional GGUF quants like Q6_K_L, the QAT version delivers noticeable quality improvements in roleplay and long-context tasks. Additionally, combining the QAT model with Multi-Token Prediction (MTP) yielded massive speedups, boosting generation speeds from ~20 t/s to up to 50 t/s.
club-3090 Adds Experimental FP8 Support for Qwen3.6-27B
r/LocalLLaMA top day6 days agoNew Tool
The open-source project club-3090 has rolled out experimental FP8 quantization support for Qwen3.6-27B. This update is highly anticipated by dual RTX 3090 users, allowing them to run the model with significantly reduced VRAM requirements. According to reports, the official Qwen3.6-27B-FP8 model performs virtually identically to the original unquantized BF16 version.
Qwen 3.6 27B DeepSWE Benchmark Results Highlight Gap Between Local and Closed-Source Models
r/LocalLLaMA top day6 days agoBenchmark
A community benchmark of Qwen 3.6 27B on DeepSWE yielded a score of 1.79% (18/20th place), slightly outperforming Haiku 4.5. Run on a single RTX 6000 Blackwell GPU via vLLM with reasoning enabled, the test averaged 32 minutes and 44k output tokens per task. The author notes that while Qwen 3.6 27B represents a 'poor man's local SOTA,' the massive gap compared to frontier closed models suggests local LLMs are struggling to keep pace in complex coding.
Exploring 2-bit QAT: Can Ultra-Compressed Large Models Outperform 4-bit Models Half Their Size?
r/LocalLLaMA top day6 days agoCommentary
A popular Reddit thread on r/LocalLLaMA discusses the potential of 2-bit Quantization Aware Training (QAT) for large MoE models (120B to 400B). While current QAT efforts focus on 4-bit, users speculate whether a 2-bit QAT model could fit into consumer hardware (64GB/128GB RAM) and outperform a 4-bit model of half its size. This approach is proposed as a practical alternative to training ternary (1.58-bit) LLMs from scratch.
MTP and QAT: What is the Relation? Running Gemma 4 31B in llama.cpp
r/LocalLLaMA top day6 days agoCommentary
A popular Reddit thread addresses user confusion over running Gemma 4 31B locally. It distinguishes between MTP (Multi-Token Prediction for inference speedup) and QAT (Quantization-Aware Training for preserving 4-bit quality). It also confirms that llama.cpp's new MTP support requires updated GGUF files and a secondary draft model file for acceleration.
Control 3D Avatars with Natural Language Using "Program as Weights" (programasweights)
r/LocalLLaMA top day7 days agoNew Tool
Developer Yuntian Deng introduced "programasweights," a framework that compiles plain-English descriptions into tiny, local action programs (loops, parallel tracks) to control 3D avatars. Instead of pre-defined buttons, users can command complex sequences like "wave while walking, then jump." The runtime code is open-source and runs entirely offline in the browser or via Python.
GMKtec Announces EVO-X3 Mini PC, Teases 192GB Ryzen AI MAX+ 495 "Strix Halo" Monster★ 78
r/LocalLLaMA top day7 days agoHardware
GMKtec has announced its EVO-X3 mini PC with upgraded I/O, including OCuLink and Wi-Fi 7. More importantly for local AI enthusiasts, the company teased a future model powered by AMD's flagship "Strix Halo" Ryzen AI MAX+ 495 APU. This upcoming monster will support up to 192GB of LPDDR5X memory, offering a highly anticipated, cost-effective alternative to Apple Silicon for running large local LLMs.
Qwen3.6 35B-A3B on a Laptop: A Local LLM "Zero to One" Milestone
r/LocalLLaMA top day7 days agoOpinion
A Reddit user detailed running Qwen3.6 35B-A3B (IQ3_XXS quantization) on an ASUS Zenbook Pro 14 (RTX 4060 8GB VRAM, 64GB RAM). Using llama.cpp, they achieved 27 TPS at 32k context and 18 TPS at 256k context. This setup serves as a highly capable, fully private local agent for file operations, CLI execution, and brainstorming, bypassing cloud privacy concerns.
Managing Multiple MCP Servers: How to Prevent Context Pollution and Token Waste
r/LocalLLaMA top day7 days agoCommentary
A popular Reddit thread on r/LocalLLaMA addresses the challenge of loading multiple Model Context Protocol (MCP) servers at startup, which floods the context window with tool definitions. Users are discussing potential solutions, including using MCP proxies/hubs to route requests through a single endpoint or implementing lazy-loading. This highlights a growing need for better orchestration tools as the local MCP ecosystem expands.
start-llama: A Handy CLI Launcher for llama-server with Easy Customization
r/LocalLLaMA top day7 days agoNew Tool
A developer has released 'start-llama', a command-line utility designed to simplify launching llama-server (llama.cpp). It allows users to manage sensible default configurations, support multiple server binaries, and apply per-model or command-line overrides. This tool streamlines local LLM deployment into a single, easily configurable step.
llama.cpp Gemma4 MTP Support Merged
r/LocalLLaMA top day7 days agoRelease
llama.cpp PR #23398 was merged on June 7, 2026, adding MTP support for Gemma4 models. The author reports over 2x average speedup on dense models, no observed speedup on MoE, and replicated AIME-26 results around 87%. Support currently covers 31B and 26B-4B variants, while E4B and E2B are not supported yet; multi-GPU may need extra draft-device configuration.
Fine-tuning an LLM to write docs like it's 1995
Hacker News (AI keywords)9 days agoTutorial
The author builds a corpus from old Microsoft manuals, cleans OCR text, generates instruction-style JSONL examples, and fine-tunes Llama 3.1 8B and Qwen 2.5 7B with QLoRA. Tests cover malloc(), a fictional Win32 API, and a deliberately anachronistic REST API prompt. Qwen fine-tunes transfer the period documentation style best, but the experiment also shows hallucination risks, tuning complexity, and why these models augment rather than replace technical writers.
Hugging Face 推出 transformers-to-mlx：讓 Apple Silicon 運行 AI 模型更簡單的重大整合★ 80
Hugging Face Blog59 days agoRelease
This article from the official Hugging Face blog, titled "The PR you would have opened yourself," focuses on the introduction of a brand-new technical…
llama.cpp 全新功能：更強大的模型管理機制（Model Management）與 Hugging Face 深度整合★ 85
Hugging Face Blog185 days agoRelease
The popular local large language model (LLM) inference tool `llama.cpp` has recently partnered with Hugging Face to launch a new "Model Management" mechanism…
介紹 AnyLanguageModel：適用於 Apple 平台的本地與遠端 LLM 統一 API★ 75
Hugging Face Blog206 days agoNew Tool
Hugging Face has officially launched a new open-source Swift framework called "AnyLanguageModel," designed to address the pain points faced by developers on…
大眾智能（Mass Intelligence）：從 GPT-5 到邊緣小模型，強大 AI 正在走向普及化★ 85
One Useful Thing (Mollick)289 days agoOpinion
In this article exploring "Mass Intelligence," University of Pennsylvania Wharton School professor Ethan Mollick reveals an imminent future: high-level…
NVIDIA Llama Nemotron Nano VLM 正式登陸 Hugging Face Hub★ 75
Hugging Face Blog351 days agoRelease
NVIDIA has partnered with Hugging Face to officially bring its latest lightweight vision-language model (VLM) — the **NVIDIA Llama Nemotron Nano VLM** — to the…

← PreviousPage 2Next →

Latest in AI

Was BitNet a dead end? What happened to ternary LLMs?

When every other post is an AI benchmark, best-model question, or slop app

LocalLLaMA post urges users not to join SpaceX, OpenAI, Anthropic IPOs

An Implementation of NanoQuant: A Flexible Binary Quantization Method

I bundled a fully local LLM inside my Unity game

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72

[3090] Gemma4 QAT + MTP quick TPS numbers

Gemma 4 Chat Template now has preserve thinking

What was your local daily driver for coding last week?

llama.cpp PR #24277 avoids KV cell copies in kv-cache

Thoughts on Gemma4 12B vs 26A4B: Which Is Better?

Google's Official Gemma 4 QAT Q4_0 GGUFs Have Higher Precision Than Unsloth's Q4_K_XL

Gemma 4 31B FP8 Matches Claude Sonnet 4.6 Medium in Custom Benchmark★ 75

User Shares Gemma 4 QAT Experience: Improved Quality and MTP Speedups

club-3090 Adds Experimental FP8 Support for Qwen3.6-27B

Qwen 3.6 27B DeepSWE Benchmark Results Highlight Gap Between Local and Closed-Source Models

Exploring 2-bit QAT: Can Ultra-Compressed Large Models Outperform 4-bit Models Half Their Size?

MTP and QAT: What is the Relation? Running Gemma 4 31B in llama.cpp

Control 3D Avatars with Natural Language Using "Program as Weights" (programasweights)

GMKtec Announces EVO-X3 Mini PC, Teases 192GB Ryzen AI MAX+ 495 "Strix Halo" Monster★ 78

Qwen3.6 35B-A3B on a Laptop: A Local LLM "Zero to One" Milestone

Managing Multiple MCP Servers: How to Prevent Context Pollution and Token Waste

start-llama: A Handy CLI Launcher for llama-server with Easy Customization

llama.cpp Gemma4 MTP Support Merged

Fine-tuning an LLM to write docs like it's 1995

Hugging Face 推出 transformers-to-mlx：讓 Apple Silicon 運行 AI 模型更簡單的重大整合★ 80

llama.cpp 全新功能：更強大的模型管理機制（Model Management）與 Hugging Face 深度整合★ 85

介紹 AnyLanguageModel：適用於 Apple 平台的本地與遠端 LLM 統一 API★ 75

大眾智能（Mass Intelligence）：從 GPT-5 到邊緣小模型，強大 AI 正在走向普及化★ 85

NVIDIA Llama Nemotron Nano VLM 正式登陸 Hugging Face Hub★ 75