A LocalLLaMA post benchmarks five Bonsai LM models, from 1.7B to about 8B parameters, on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA. The tests compare 7W, 15W, 25W, and MAXN modes across latency, throughput, energy per token, and thermals. The main takeaway is that 25W is usually the best efficiency/performance point for models up to 4B, while Bonsai-8B may favor 15W for lower power.
The post describes turning an unused Jetson Orin NX into a compact local LLM server for Hermes Agent testing. The goals were low noise, over 10 tok/s generation, 300 tok/s prompt processing, at least 65K context, and a custom case. After testing Gemma 4, Qwen 3.6, and many quant variants, the author reports Gemma 4 26B A4B UD Q2_K_XL reaching 66K context and 10.21 tok/s near 60K context.
QbitAI’s headline says a domestic Chinese team has built a 4B-parameter “cognitive model” suitable for edge deployment. The framing links it to a model direction previously associated with Andrej Karpathy. Since the article body was not provided, details such as the model name, architecture, benchmark results, hardware requirements, open-source status, and licensing remain unverified.
The Reddit post links to ggml-org/llama.cpp Pull Request #24282, which adds MTP support for Gemma-4 E2B and E4B assistants. The submitter frames it as useful for tiny Gemma models on phones, low-end machines, Raspberry Pi, or similarly constrained devices. The post does not include benchmarks, merge status, or setup instructions, so it should be treated as a development signal rather than a finished release.
Mistral AI introduced Mistral 3, a new open model family under Apache 2.0. It includes Mistral Large 3, a 675B-parameter sparse MoE with 41B active parameters, plus Ministral 3 models at 3B, 8B, and 14B. The release targets frontier open-weight use, multimodal and multilingual workflows, enterprise customization, and efficient local or edge deployments.
Mistral AI introduced Mistral 3, a new open model family including Mistral Large 3 and Ministral 3 models at 3B, 8B, and 14B sizes. Large 3 is a 675B-parameter sparse MoE model with 41B active parameters, while Ministral 3 targets local and edge use cases. The models are released under Apache 2.0 and are available through Mistral AI Studio, Hugging Face, Amazon Bedrock, and other platforms.
A r/LocalLLaMA user says they have tested many local TTS tools, but none match ElevenLabs for expressiveness, voices, and cloning. They list moss-nano and Kokoro as the best edge-device candidates so far, with edgeTTS as a free/cloud option. The post asks for community experience connecting agents such as Hermes, openclaw, or opencode to Telegram voice notes or real-time voice conversations.
A developer has shared a practical guide on clustering three NVIDIA Jetson Nano Orin Super boards, leveraging their Ampere CUDA cores and unified memory. This project is part of 'smolcluster,' an initiative to make distributed AI training and inference accessible using everyday hardware like Macs, Raspberry Pis, and Jetsons. The series aims to explore whether heterogeneous clusters (mixing different hardware architectures) can effectively run local LLMs.
Google released new Gemma 4 checkpoints optimized with Quantization-Aware Training to preserve quality after compression. The release includes Q4_0 checkpoints and a mobile-focused quantization format that can reduce Gemma 4 E2B memory use to about 1GB, or below 1GB for a text-only configuration. The models are available through Hugging Face and supported across llama.cpp, Ollama, LM Studio, LiteRT-LM, Transformers.js, SGLang, vLLM, MLX, and Unsloth.
At Computex 2026, NXP focused on Physical AI and introduced its Neural Axis architecture for edge devices. The architecture emphasizes low latency, high security, and hardware-based trust for real-time responses. The article frames this as important for robotics, autonomous vehicles, and other physical-world AI deployments where safe operation is essential.
Aitech announced it will integrate NVIDIA IGX Thor into its space supercomputer for low Earth orbit missions. The goal is to provide onboard AI edge computing and enable real-time inference directly in orbit. By processing more data in space, the system aims to reduce dependence on ground communications and extend AI compute beyond Earth-based infrastructure.
In this episode of the Latent Space podcast, the hosts and guest host Noah Smith (author of the well-known economics and technology blog Noahpinion)…
Google and Hugging Face have jointly announced a new generation of open-weight models — "Gemma 4." This model represents a major breakthrough in on-device AI…
IBM has officially launched its new lightweight multimodal model on Hugging Face — the Granite 4.0 3B Vision. With 3 billion (3B) parameters, this model is…
This issue of Import AI 448, written by Jack Clark, takes a deep dive into the latest developments in AI R&D, automated hardware optimization, and the…
Hugging Face has entered into a deep collaboration with semiconductor giant NXP (NXP Semiconductors), aimed at solving the challenge of deploying advanced…
A historic milestone has arrived in the open-source AI world: GGML and llama.cpp — the open-source projects founded by Georgi Gerganov that laid the…
As large language models (LLMs) develop in two divergent directions — with extremely large cloud-based models at one end and lightweight "Nano"-scale models…
Against the backdrop of explosive global growth in artificial intelligence, compute has become the core resource that determines technological competitiveness…
As healthcare demands increase and medical staffing shortages worsen, the development of medical robots — such as robots for ward supply delivery, assisted…
This article, jointly published by IBM and Hugging Face, delves into the technical details and application scenarios of the brand-new ultra-lightweight model…
Google DeepMind has officially announced the addition of a highly distinctive and specialized new member to its open-source model family — Gemma 3 270M. This…
Visual Language Models (VLMs) combine computer vision with natural language processing, enabling complex tasks such as image captioning and visual question…
Arm has officially announced on the Hugging Face blog that it will actively participate in the upcoming PyTorch Conference. As the Arm architecture gains…
As AI Agent applications become increasingly widespread, running large language models (LLMs) efficiently on personal computers (such as AI PCs powered by…
Writer, a leading provider of enterprise AI solutions, has officially announced the launch of its new "Palmyra-mini" model series on the Hugging Face platform…
In this article exploring "Mass Intelligence," University of Pennsylvania Wharton School professor Ethan Mollick reveals an imminent future: high-level…
As generative AI advances rapidly, deploying massive models to resource-constrained edge devices — such as smartphones, smart hardware, and AI PCs — has become…
Arm and Hugging Face have announced a collaboration to launch "Neural Super Sampling (NSS)" technology and related models, officially bringing AI-driven image…
NVIDIA has partnered with Hugging Face to officially bring its latest lightweight vision-language model (VLM) — the **NVIDIA Llama Nemotron Nano VLM** — to the…