Google’s DiffusionGemma is an Apache 2.0 experimental open model using text diffusion instead of standard autoregressive decoding. The 26B MoE model activates 3.8B parameters during inference and is designed for low-latency local workflows. Google claims up to 4x faster generation on dedicated GPUs, while noting that output quality is below standard Gemma 4 and production-quality use cases should still prefer Gemma 4.
Google is upgrading NotebookLM from a note-focused assistant into a research agent capable of multi-step work. The updated tool can analyze across documents, search the web, and help automate broader research workflows. It can also export results into formats such as presentations and documents, making it more useful for students, researchers, educators, and content creators who need to move from source material to finished outputs.
A r/LocalLLaMA user is looking for benchmarks comparing Gemma 4 4-bit QAT models, via Unsloth, against standard 8-bit non-QAT quantized models. They understand QAT is expected to preserve much of the BF16 baseline accuracy, but want hard numbers against traditional 8-bit PTQ. The post highlights scattered feedback but no clear head-to-head evaluation yet.
Google is rolling out broad updates to NotebookLM, its AI-powered note-taking and research app launched in 2023. The app now uses Google’s upgraded Gemini 3.5 model, which the company says should provide more accurate and reliable responses. The update also adds a cloud computer and help finding sources, expanding NotebookLM beyond source-based Q&A into a broader research assistant workflow.
A r/LocalLLaMA user shared quick throughput numbers for Gemma4 QAT with MTP speculative decoding on an RTX 3090 24GB setup. They report roughly 1.2-1.8x TPS improvement, with Gemma 4 31B moving from about 40 tok/s to 70-80 tok/s. The author frames this as a rough benchmark, using 11 task categories and noting stochastic variation from temp 1.0.
The article explains how modern LLMs convert text into token IDs, embeddings, and position-aware vectors before passing them through stacked transformer blocks. It covers attention, multi-head attention, KV cache, GQA, feed-forward networks, MoE, residual streams, normalization, and decoding. Its goal is educational: helping readers understand the common architecture behind many current model families and read model cards or papers more confidently.
Prominent scholar Ethan Mollick, in his latest article, points out that we have officially crossed beyond the era of simple "Chatbots" and entered what he…
On February 12, 2026, Google DeepMind announced the launch of its most advanced reasoning mode update — Gemini 3 Deep Think. This model is Google's…
Google DeepMind recently published an article exploring how its deep-reasoning model, "Gemini Deep Think," is transforming the landscape of mathematics and…
This official blog post from Google DeepMind details how AI is making a tangible impact in educational settings and helping to reduce the burden on teachers…
Google DeepMind has announced that its latest reasoning model, "Gemini 2.5 Deep Think," has achieved gold-medal-level performance at the International…
Wharton School professor Ethan Mollick has put together a highly personal and practical operating guide for the AI landscape of late 2025. He emphasizes that…
University of Pennsylvania Wharton School professor Ethan Mollick, in his latest article, compares the experience of collaborating with generative AI (such as…
University of Pennsylvania Wharton School professor Ethan Mollick recently published an extremely practical AI quick guide, "Using AI Right Now: A Quick…