Unsloth uploaded a GGUF version of Cohere's North-Mini-Code 1.0 to Hugging Face, making local inference possible for this 30B A3B MoE coding-focused model. The poster links the release to llama.cpp PR #24260, suggesting new architecture support may be required. No benchmarks or test results have been shared yet; this is an early community resource post.
A r/LocalLLaMA user shared informal impressions of JetBrains Mellum 2, focusing on local coding-style tasks and tool calls. On an AMD Radeon RX 7900 XT with llama.cpp Vulkan and 131K context, the model reportedly generated around 111 tokens/s and stayed above 100 tokens/s near full context. The author stresses this is not a scientific benchmark, but a practical workflow-oriented test.
Xiaomi announced MiMo-V2.5-Pro-UltraSpeed with TileRT, claiming over 1,000 tokens/s decode speed on a 1-trillion-parameter MoE model. The company says it runs on a single standard 8-GPU commodity node, not wafer-scale or SRAM-heavy specialized hardware. The claimed stack combines FP4 MoE expert quantization, DFlash speculative decoding, and TileRT low-latency inference kernels, but independent validation is still needed.
Mistral AI introduced Mistral 3, a new open model family under Apache 2.0. It includes Mistral Large 3, a 675B-parameter sparse MoE with 41B active parameters, plus Ministral 3 models at 3B, 8B, and 14B. The release targets frontier open-weight use, multimodal and multilingual workflows, enterprise customization, and efficient local or edge deployments.
Mistral AI introduced Mistral Small 4 as the next major release in the Mistral Small family. It combines reasoning, multimodal, and agentic coding capabilities into one open model with configurable reasoning effort. The model uses a MoE architecture, supports a 256k context window and text-image inputs, and is available through Mistral API, AI Studio, Hugging Face, NVIDIA NIM, and common inference stacks.
Mistral AI introduced Mistral 3, a new open model family including Mistral Large 3 and Ministral 3 models at 3B, 8B, and 14B sizes. Large 3 is a 675B-parameter sparse MoE model with 41B active parameters, while Ministral 3 targets local and edge use cases. The models are released under Apache 2.0 and are available through Mistral AI Studio, Hugging Face, Amazon Bedrock, and other platforms.
Mistral Small 4 is the next major release in the Mistral Small family, unifying Magistral-style reasoning, Pixtral-style multimodality, and Devstral-style coding agents. It uses a MoE architecture with 119B total parameters, 6B active parameters per token, a 256k context window, and configurable reasoning effort. The model is available via Mistral API, AI Studio, Hugging Face, open-source serving stacks, and NVIDIA deployment options.
A popular Reddit thread on r/LocalLLaMA discusses the potential of 2-bit Quantization Aware Training (QAT) for large MoE models (120B to 400B). While current QAT efforts focus on 4-bit, users speculate whether a 2-bit QAT model could fit into consumer hardware (64GB/128GB RAM) and outperform a 4-bit model of half its size. This approach is proposed as a practical alternative to training ternary (1.58-bit) LLMs from scratch.
According to AINews, the AI research team Thinking Machines (affectionately nicknamed "Team Thinky" by the community) has recently unveiled a new native…
Mixture of Experts (MoE) has become the mainstream architecture for current large language models (LLMs). This article takes an in-depth look at how MoE…
This blog post from Hugging Face reviews the full year of technical evolution since the "DeepSeek Moment" at the start of 2025 — the release of DeepSeek-V3 and…
The DeepSeek-V3 and R1 models released in January 2025 have been hailed as the "DeepSeek Moment" in the AI world. This upheaval not only shattered the myth…
The Technology Innovation Institute (TII) of Abu Dhabi has officially launched the new Falcon 3 open-source model family on Hugging Face. This marks a major…
Snowflake recently launched a brand-new open-source large language model called "Snowflake Arctic" — a Mixture of Experts (MoE) model designed for…
In the large language model (LLM) space, the Mixture of Experts (MoE) architecture (as seen in models like Mixtral 8x7B) has proven capable of dramatically…
Looking back on 2023, the most notable trend in the AI landscape was the explosive growth of open-source large language models (Open LLMs). In this annual…
Mixture of Experts (MoE) has become a core technology for improving the performance and efficiency of today's large language models (LLMs). Traditional "dense…
French AI startup Mistral AI officially released its highly anticipated open-source Mixture of Experts (MoE) model — Mixtral 8x7B. The model caused a sensation…