Latest in AI

Showing:local-llmResearchersClear ×

🔥 Trending today

anthropic6 export-controls4 model-access3 amazon3 national-security2 open-source2 ai-regulation2 government-policy2 enterprise-ai2 compliance2

Topic

Release New Tool Tutorial Business Paper Benchmark Opinion Regulation

For

General Developers Designers Product Founders Marketing Researchers Students

An Implementation of NanoQuant: A Flexible Binary Quantization Method
r/LocalLLaMA top day6 days agoNew Tool
A r/LocalLLaMA post presents an unofficial PyTorch implementation of NanoQuant, a 2026 post-training quantization method for dense transformers. The method factorizes weights into scaling vectors and binary matrices, then quantizes and fine-tunes blocks sequentially to reduce hardware requirements. Early Qwen3-0.6B and Qwen3-4B experiments are promising for base models, but instruct quality remains weak and highly dependent on calibration data.
I bundled a fully local LLM inside my Unity game
r/LocalLLaMA top day6 days agoRelease
A developer shared a Unity game, Simulation Simulator, that bundles a local LLM with no internet, cloud service, or API key required. The game is a campfire chat simulator about DMT, simulation theory, and a monitor-headed friend, with five endings driven by natural AI interaction. The author sees this as a path toward richer NPCs, while noting local TTS and translation are still too slow for smooth gameplay.
Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72
r/LocalLLaMA top day6 days agoNew Tool
Luce Spark is an open-source MoE offload system for running 33B-35B A3B models on 16GB-class GPUs. It keeps frequently routed experts on GPU, stores the long tail in system RAM, and swaps cold experts through a bounded async cache. The author reports 13.3 GiB for Qwen3.6 35B-A3B and about 100 tok/s with Spark optimizations, but notes real 16GB GPU testing is still missing.
[3090] Gemma4 QAT + MTP quick TPS numbers
r/LocalLLaMA top day6 days agoBenchmark
A r/LocalLLaMA user shared quick throughput numbers for Gemma4 QAT with MTP speculative decoding on an RTX 3090 24GB setup. They report roughly 1.2-1.8x TPS improvement, with Gemma 4 31B moving from about 40 tok/s to 70-80 tok/s. The author frames this as a rough benchmark, using 11 task categories and noting stochastic variation from temp 1.0.
Gemma 4 Chat Template now has preserve thinking
r/LocalLLaMA top day6 days agoRelease
A r/LocalLLaMA post notes that Gemma 4’s chat template now has “preserve thinking.” The linked discussion points to google/gemma-4-31B-it on Hugging Face, suggesting a template-level change rather than a new model release or benchmark. The original post does not provide detailed usage notes, defaults, compatibility information, or measured effects.
What was your local daily driver for coding last week?
r/LocalLLaMA top day6 days agoCommentary
This r/LocalLLaMA post is a brief community poll asking users what their local coding daily driver was last week. The post asks commenters to share their favorite model and quant, but the provided text does not include poll options, results, or specific model names. Its value is mainly as a community signal for tracking local LLM coding preferences.
llama.cpp PR #24277 avoids KV cell copies in kv-cache
r/LocalLLaMA top day6 days agoRelease
ggml-org/llama.cpp merged PR #24277 by ggerganov, titled “kv-cache: avoid kv cells copies.” The Reddit post says the change improves MTP performance for Gemma-4 and was merged the previous day. It is available starting with the b9551 release, making it relevant for local inference users tracking llama.cpp performance updates.
Thoughts on Gemma4 12B vs 26A4B: Which Is Better?
r/LocalLLaMA top day6 days agoOpinion
The post asks the LocalLLaMA community to compare Gemma4 12B and 26A4B, explicitly excluding the 31B model from discussion. The user is mainly interested in creative tasks, writing, and chatting, with coding treated as optional rather than central. No benchmarks or examples are provided, so the post is best read as a model-selection question about subjective quality and practical use.
Google's Official Gemma 4 QAT Q4_0 GGUFs Have Higher Precision Than Unsloth's Q4_K_XL
r/LocalLLaMA top day6 days agoCommentary
An analysis of Gemma 4 QAT GGUF files reveals that Google's official 'Q4_0' releases actually employ a mixed-precision strategy. For smaller models like E2B and E4B, Google keeps critical token embeddings in Q6_K and certain projection weights in F16. This makes Google's Q4_0 files larger and more precise than Unsloth's 'Q4_K_XL' versions, which default to standard Q4_0 for almost all tensors.
Gemma 4 31B FP8 Matches Claude Sonnet 4.6 Medium in Custom Benchmark★ 75
r/LocalLLaMA top day6 days agoBenchmark
A Reddit user shared benchmark results showing Google's Gemma 4 31B (FP8) performing on par with Claude Sonnet 4.6 Medium. The custom evaluation harness tested complex tasks including Neo4j Cypher queries, entity extraction, agentic tool calling, Python coding, and multi-vector retrieval synthesis. This highlights how quantized mid-sized open-source models are closing the gap with leading proprietary frontier models.
Qwen 3.6 27B DeepSWE Benchmark Results Highlight Gap Between Local and Closed-Source Models
r/LocalLLaMA top day7 days agoBenchmark
A community benchmark of Qwen 3.6 27B on DeepSWE yielded a score of 1.79% (18/20th place), slightly outperforming Haiku 4.5. Run on a single RTX 6000 Blackwell GPU via vLLM with reasoning enabled, the test averaged 32 minutes and 44k output tokens per task. The author notes that while Qwen 3.6 27B represents a 'poor man's local SOTA,' the massive gap compared to frontier closed models suggests local LLMs are struggling to keep pace in complex coding.
Exploring 2-bit QAT: Can Ultra-Compressed Large Models Outperform 4-bit Models Half Their Size?
r/LocalLLaMA top day7 days agoCommentary
A popular Reddit thread on r/LocalLLaMA discusses the potential of 2-bit Quantization Aware Training (QAT) for large MoE models (120B to 400B). While current QAT efforts focus on 4-bit, users speculate whether a 2-bit QAT model could fit into consumer hardware (64GB/128GB RAM) and outperform a 4-bit model of half its size. This approach is proposed as a practical alternative to training ternary (1.58-bit) LLMs from scratch.
Control 3D Avatars with Natural Language Using "Program as Weights" (programasweights)
r/LocalLLaMA top day7 days agoNew Tool
Developer Yuntian Deng introduced "programasweights," a framework that compiles plain-English descriptions into tiny, local action programs (loops, parallel tracks) to control 3D avatars. Instead of pre-defined buttons, users can command complex sequences like "wave while walking, then jump." The runtime code is open-source and runs entirely offline in the browser or via Python.
llama.cpp Gemma4 MTP Support Merged
r/LocalLLaMA top day7 days agoRelease
llama.cpp PR #23398 was merged on June 7, 2026, adding MTP support for Gemma4 models. The author reports over 2x average speedup on dense models, no observed speedup on MoE, and replicated AIME-26 results around 87%. Support currently covers 31B and 26B-4B variants, while E4B and E2B are not supported yet; multi-GPU may need extra draft-device configuration.
Fine-tuning an LLM to write docs like it's 1995
Hacker News (AI keywords)9 days agoTutorial
The author builds a corpus from old Microsoft manuals, cleans OCR text, generates instruction-style JSONL examples, and fine-tunes Llama 3.1 8B and Qwen 2.5 7B with QLoRA. Tests cover malloc(), a fictional Win32 API, and a deliberately anachronistic REST API prompt. Qwen fine-tunes transfer the period documentation style best, but the experiment also shows hallucination risks, tuning complexity, and why these models augment rather than replace technical writers.
Hugging Face 推出 transformers-to-mlx：讓 Apple Silicon 運行 AI 模型更簡單的重大整合★ 80
Hugging Face Blog59 days agoRelease
This article from the official Hugging Face blog, titled "The PR you would have opened yourself," focuses on the introduction of a brand-new technical…
llama.cpp 全新功能：更強大的模型管理機制（Model Management）與 Hugging Face 深度整合★ 85
Hugging Face Blog185 days agoRelease
The popular local large language model (LLM) inference tool `llama.cpp` has recently partnered with Hugging Face to launch a new "Model Management" mechanism…
大眾智能（Mass Intelligence）：從 GPT-5 到邊緣小模型，強大 AI 正在走向普及化★ 85
One Useful Thing (Mollick)289 days agoOpinion
In this article exploring "Mass Intelligence," University of Pennsylvania Wharton School professor Ethan Mollick reveals an imminent future: high-level…
NVIDIA Llama Nemotron Nano VLM 正式登陸 Hugging Face Hub★ 75
Hugging Face Blog351 days agoRelease
NVIDIA has partnered with Hugging Face to officially bring its latest lightweight vision-language model (VLM) — the **NVIDIA Llama Nemotron Nano VLM** — to the…
Open R1：如何在本機使用 LM Studio 運行 OlympicCoder 進行程式開發★ 75
Hugging Face Blog451 days agoTutorial
Hugging Face has recently released an updated practical guide for the Open R1 project, walking developers through how to locally deploy and run "OlympicCoder"…
GGML 基礎入門介紹：讓大語言模型在消費級硬體上高效運行的關鍵技術★ 80
Hugging Face Blog670 days agoTutorial
GGML is a lightweight, zero-dependency C/C++ tensor library developed by Georgi Gerganov. It was originally designed to enable efficient local inference of the…
WWDC 24：使用 Core ML 在 Apple 裝置上運行 Mistral 7B 模型★ 75
Hugging Face Blog692 days agoTutorial
Following Apple's major Core ML updates announced at WWDC 24, Hugging Face published a practical guide detailing how to convert the popular open-source large…
筆電上的聊天機器人：在 Intel Meteor Lake 上運行 Phi-2★ 70
Hugging Face Blog816 days agoTutorial
This technical blog post from Hugging Face details how to locally deploy and run Microsoft's lightweight Phi-2 language model (2.7 billion parameters) on a…
本地端執行 Llama 2 完整指南：支援 Mac、Linux、Windows 與手機★ 70
Replicate Blog1,058 days agoTutorial
This guide from Replicate provides detailed instructions on how to run Meta's open-source large language model Llama 2 locally on various operating systems…
LLaMA 週報第三週：開源 AI 生態系的狂飆起點
Replicate Blog1,184 days agoCommentary
Within just three weeks of Meta releasing the LLaMA (Large Language Model Meta AI) model, the open-source community demonstrated an astonishing pace of…

← PreviousPage 2

Latest in AI

An Implementation of NanoQuant: A Flexible Binary Quantization Method

I bundled a fully local LLM inside my Unity game

Luce Spark: a 35B MoE on a 16 GB GPU, without the offload tax★ 72

[3090] Gemma4 QAT + MTP quick TPS numbers

Gemma 4 Chat Template now has preserve thinking

What was your local daily driver for coding last week?

llama.cpp PR #24277 avoids KV cell copies in kv-cache

Thoughts on Gemma4 12B vs 26A4B: Which Is Better?

Google's Official Gemma 4 QAT Q4_0 GGUFs Have Higher Precision Than Unsloth's Q4_K_XL

Gemma 4 31B FP8 Matches Claude Sonnet 4.6 Medium in Custom Benchmark★ 75

Qwen 3.6 27B DeepSWE Benchmark Results Highlight Gap Between Local and Closed-Source Models

Exploring 2-bit QAT: Can Ultra-Compressed Large Models Outperform 4-bit Models Half Their Size?

Control 3D Avatars with Natural Language Using "Program as Weights" (programasweights)

llama.cpp Gemma4 MTP Support Merged

Fine-tuning an LLM to write docs like it's 1995

Hugging Face 推出 transformers-to-mlx：讓 Apple Silicon 運行 AI 模型更簡單的重大整合★ 80

llama.cpp 全新功能：更強大的模型管理機制（Model Management）與 Hugging Face 深度整合★ 85

大眾智能（Mass Intelligence）：從 GPT-5 到邊緣小模型，強大 AI 正在走向普及化★ 85

NVIDIA Llama Nemotron Nano VLM 正式登陸 Hugging Face Hub★ 75

Open R1：如何在本機使用 LM Studio 運行 OlympicCoder 進行程式開發★ 75

GGML 基礎入門介紹：讓大語言模型在消費級硬體上高效運行的關鍵技術★ 80

WWDC 24：使用 Core ML 在 Apple 裝置上運行 Mistral 7B 模型★ 75

筆電上的聊天機器人：在 Intel Meteor Lake 上運行 Phi-2★ 70

本地端執行 Llama 2 完整指南：支援 Mac、Linux、Windows 與手機★ 70

LLaMA 週報第三週：開源 AI 生態系的狂飆起點