NVFP4 Support Merged in llama.cpp: How to Use 4-bit Blackwell Quantization
Original: NVFP4 on llama.cpp?
A Reddit discussion explores how to convert and run NVFP4 (NVIDIA FP4) quantized models on llama.cpp using Blackwell GPUs.
Following the merge of native NVFP4 (NVIDIA FP4) support in llama.cpp, users are exploring how to leverage this format on Blackwell GPUs (such as the RTX 50-series). The discussion focuses on converting NVFP4 safetensors (like Gemma 4 QAT) to GGUF format and whether importance matrices (imatrix) are required. This enablement promises significant performance gains for local LLM execution on next-gen hardware.
In the LocalLLaMA community, a discussion about "how to use NVFP4 with llama.cpp" has attracted widespread attention. With the growing adoption of NVIDIA Blackwell architecture graphics cards (such as the RTX 50 series), and with `llama.cpp` having officially merged native support for NVFP4 (NVIDIA FP4, i.e., the 4-bit floating-point format), enthusiasts deploying large models locally are seeing a new opportunity for performance breakthroughs.
Free shows the 3-line summary; Pro unlocks the full deep summary (~300 words) so you never have to click through.
See Pro plans →Want the original English / full article?
Read on r/LocalLLaMA top day →Summaries are AI-generated; the original article is authoritative.