Heaps do lie: debugging a memory leak in vLLM | EveryCorner

This Mistral AI Engineering Deep Dive documents a very typical but extremely deep AI infrastructure debugging case. While conducting pre-production testing of vLLM disaggregated inference, the team found that system memory would rise linearly without any error messages or crashes, increasing by about 400MB per minute, which could lead to OOM after several hours. The tricky part was that the problem only reproduced under a specific combination: using vLLM, Mistral Medium 3.1, Prefill/Decode disaggregation, NIXL, and with graph compilation enabled. At first the team suspected a high-level Python or vLLM logic issue, but Memray, Guppy 3, and Heaptrack did not directly reveal any leak within the heap; Heaptrack instead showed a stable heap but an RSS discrepancy, suggesting that the problem was not in the heap managed by traditional malloc/free, but in anonymous mmap regions or lower-level system allocations. The team then used pmap to observe the memory mappings in /proc, and found that some anonymous memory mappings kept growing and their addresses changed, so they used BPFtrace to trace the mmap, munmap, and mremap syscalls. BPFtrace showed that the suspicious allocations came from mmap, and were called via the glibc raw syscall wrapper, bypassing the usual LD_PRELOAD hook. Since BPFtrace could not reliably obtain the complete user-space stack, the team further automated GDB, setting conditional breakpoints at specific syscall addresses, and cross-referenced the allocation return addresses with the growth regions observed in pmap. In the end, the stack pointed to UCX: to achieve InfiniBand/RDMA performance, UCX hooks mmap/munmap and supports a Registration Cache, but this broad interception also pulled in ordinary memory operations from Python, NIXL, and vLLM. The problem originated from UCX placing regions into an invalidation queue after munmap, with the default unreleased limit set to infinite, causing the queue and related allocations to keep growing. Mistral's solution in this vLLM scenario was to set UCX_MEM_MMAP_HOOK_MODE=none, because NIXL only needs to register one large contiguous block of KVCache memory, so disabling the hook does not affect performance; an alternative is to set UCX_RCACHE_MAX_UNRELEASED=1024. The team also worked with the relevant maintainers of vLLM, NIXL, and UCX to drive fixes and future default adjustments. The value of this article lies not in a new model release, but in demonstrating the reality of debugging a large AI serving stack: abstraction layers improve efficiency, but when encountering performance-oriented low-level dependencies, engineers may still need to trace all the way down to kernel tracing, syscalls, and memory hooks.