A LocalLLaMA post benchmarks five Bonsai LM models, from 1.7B to about 8B parameters, on a $250 Jetson Orin Nano Super 8GB using llama.cpp CUDA. The tests compare 7W, 15W, 25W, and MAXN modes across latency, throughput, energy per token, and thermals. The main takeaway is that 25W is usually the best efficiency/performance point for models up to 4B, while Bonsai-8B may favor 15W for lower power.
A Reddit user shared benchmark results showing Google's Gemma 4 31B (FP8) performing on par with Claude Sonnet 4.6 Medium. The custom evaluation harness tested complex tasks including Neo4j Cypher queries, entity extraction, agentic tool calling, Python coding, and multi-vector retrieval synthesis. This highlights how quantized mid-sized open-source models are closing the gap with leading proprietary frontier models.