Luce Spark is an open-source MoE offload system for running 33B-35B A3B models on 16GB-class GPUs. It keeps frequently routed experts on GPU, stores the long tail in system RAM, and swaps cold experts through a bounded async cache. The author reports 13.3 GiB for Qwen3.6 35B-A3B and about 100 tok/s with Spark optimizations, but notes real 16GB GPU testing is still missing.
As diffusion models (such as Flux.1 and Stable Diffusion 3) continue to grow in parameter count — often reaching tens of billions or even hundreds of billions…
In the generative AI domain, latent diffusion models (such as Stable Diffusion, FLUX.1, etc.) operate in two main stages: first, denoising and generation take…
### Background and Challenges As generative AI technology evolves, image and video generation models are increasingly transitioning from traditional UNet…
During the inference process of large language models (LLMs), the self-attention mechanism needs to store the Key and Value vectors of historical tokens (i.e…
### Core Background and Challenges DeepFloyd IF is an advanced text-to-image model released by DeepFloyd, a research lab under Stability AI. Unlike the…
This technical blog post from Hugging Face documents in detail the practical process of optimizing inference for BLOOM, the open-source multilingual large…