The open-source project club-3090 has rolled out experimental FP8 quantization support for Qwen3.6-27B. This update is highly anticipated by dual RTX 3090 users, allowing them to run the model with significantly reduced VRAM requirements. According to reports, the official Qwen3.6-27B-FP8 model performs virtually identically to the original unquantized BF16 version.
A Reddit user highlighted a limitation in llama-server's router mode (`--models-preset`): child processes spawn and initialize CUDA contexts on all available GPUs, even when pinned to a single card. When other GPUs are fully utilized by a large model, launching a smaller model fails with a CUDA OOM error because it cannot allocate the context stub on the maxed-out cards. Currently, child processes inherit the base environment, preventing per-model `CUDA_VISIBLE_DEVICES` configuration.
As the parameter counts of generative AI and large language models (LLMs) push into the tens and hundreds of billions, the memory of a single GPU has long been…
This classic technical blog post from Hugging Face systematically guides developers in understanding and mastering distributed training techniques within the…
Hugging Face has officially released a new open-source library called `Accelerate` — a lightweight helper library designed for PyTorch that aims to solve the…