JetBrains Mellum 2: a really good and performant model

Original: Jetbrains Mellum 2: a really good and performant model

A Reddit user reports strong local speed and tool-use results from JetBrains Mellum 2 on AMD hardware.

A r/LocalLLaMA user shared informal impressions of JetBrains Mellum 2, focusing on local coding-style tasks and tool calls. On an AMD Radeon RX 7900 XT with llama.cpp Vulkan and 131K context, the model reportedly generated around 111 tokens/s and stayed above 100 tokens/s near full context. The author stresses this is not a scientific benchmark, but a practical workflow-oriented test.

This r/LocalLLaMA post is based on a user's personal test experience of JetBrains Mellum 2. The main focus is not strict academic benchmarking, but rather evaluating model utility based on daily development tasks, tool calls, and local inference speed. The authors tested JetBrains/Mellum2-12B-A2.5B-Thinking, a 12B MoE model that enables about 2.5B parameters each time. The test environment included AMD Radeon RX 7900 XT 20GB, AMD Ryzen 9 3900X, 128GB DDR4 RAM, backend using llama.cpp Vulkan b9544, context set to 131,072 tokens, KV cache using bf16. The author reported prompt eval at about 492.7 tokens/s, with a generation speed of about 111.2 tokens/s or about 9ms/token, and stated that even at around 130K contexts, the generation speed did not fall below 100 tokens/s.

Full summary

Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.

See Pro plans →

Summaries are AI-generated; the original article is authoritative.