The post describes turning an unused Jetson Orin NX into a compact local LLM server for Hermes Agent testing. The goals were low noise, over 10 tok/s generation, 300 tok/s prompt processing, at least 65K context, and a custom case. After testing Gemma 4, Qwen 3.6, and many quant variants, the author reports Gemma 4 26B A4B UD Q2_K_XL reaching 66K context and 10.21 tok/s near 60K context.
Omi Health’s founder says he fine-tuned NVIDIA Parakeet TDT 0.6B v2 for clinical speech and released Omi Med STT v1 under CC-BY-4.0. The runtime supports Mac, Windows, and Linux, auto-selecting MLX, NeMo, or GGUF/parakeet.cpp backends. In the author’s held-out medical benchmark, it reports 2.37% medical-WER and 145× realtime on local A10 compute.
A community benchmark of Qwen 3.6 27B on DeepSWE yielded a score of 1.79% (18/20th place), slightly outperforming Haiku 4.5. Run on a single RTX 6000 Blackwell GPU via vLLM with reasoning enabled, the test averaged 32 minutes and 44k output tokens per task. The author notes that while Qwen 3.6 27B represents a 'poor man's local SOTA,' the massive gap compared to frontier closed models suggests local LLMs are struggling to keep pace in complex coding.