COMPLETED Systems

Infer-GC

GC-aware LLM inference engine applying garbage collection semantics (reference counting, generational collection, memory pooling) to Transformer KV cache management — 37.8% GPU memory reduction for long contexts, 37.6% for high concurrency, 52.9% for shared prefixes. 15.3K LOC, 68 tests.

status COMPLETED
type Systems
started 2024-08-01
completed 2025-01-01
stack Python PyTorch CUDA PagedAttention Prometheus

// DESCRIPTION

The Problem: LLM Inference Suffers from Memory Fragmentation and Unbounded KV Cache Growth

Running large language model inference at scale is a memory management problem as much as a compute problem. The KV cache — the stored key-value tensors for all previous tokens in a conversation — grows linearly with sequence length and can consume dozens of gigabytes of GPU memory for long-context requests. Three specific failure modes degrade production LLM serving:

  1. GPU memory fragmentation: As requests of varying lengths complete and new requests begin, the KV cache memory becomes fragmented. Small free regions between large allocations cannot be used for new requests even when total free memory is sufficient. This mimics the classic heap fragmentation problem that motivated garbage collection in programming languages decades ago.
  2. Unbounded KV cache growth: Naive implementations allocate KV cache greedily and never release it until the request completes. For speculative execution, beam search, and multi-turn conversations, this means cache that will never be needed again continues consuming GPU memory throughout the request lifetime.
  3. Tensor leaks: In complex inference pipelines with speculative decoding, rejected drafts, and early stopping, tensors that are no longer needed can be retained by reference cycles in Python, causing gradual GPU memory exhaustion that requires periodic server restarts.

问题:LLM推理生产部署面临三类内存管理失效:(1) GPU内存碎片化——不同长度请求完成后KV缓存内存碎片化,小空闲区域无法被新请求使用;(2) KV缓存无限增长——朴素实现贪婪分配KV缓存直到请求完成,推测执行、束搜索的不再需要的缓存持续占用GPU内存;(3) 张量泄漏——Python引用循环导致不再需要的张量被保留,引发渐进式GPU内存耗尽需周期性服务器重启。

Situation & Task: Apply GC Semantics to the KV Cache

Infer-GC's central insight is that the KV cache management problem is structurally identical to the heap management problem that garbage collectors were designed to solve. The KV cache is a pool of fixed-format tensors with known reference patterns: tokens reference the cache entries of their attention window; completed generations release all their cache entries; prefix sharing means multiple requests reference the same cache entries.

By applying well-understood GC techniques — reference counting, generational collection, size-class memory pooling — to the KV cache, Infer-GC achieves automatic memory management that eliminates fragmentation, enables incremental collection during inference pauses, and supports copy-on-write prefix sharing.

The project is implemented in 15,300 lines of Python and CUDA with 68 tests covering unit correctness, memory accounting, and end-to-end generation throughput. Prometheus metrics expose real-time memory pressure, collection frequency, and cache hit rates.

任务:Infer-GC的核心洞察是KV缓存管理问题在结构上与GC设计解决的堆管理问题相同。KV缓存是已知引用模式的固定格式张量池:令牌引用其注意力窗口的缓存条目;完成的生成释放所有缓存条目;前缀共享意味着多个请求引用相同缓存条目。项目15,300行Python和CUDA,68个测试,Prometheus指标监控内存压力和缓存命中率。

Innovation: Three GC Techniques Applied to KV Cache

Reference Counting with Cycle Detection: Each KV cache block maintains a reference count. When a generation step completes and tokens are consumed or rejected (in speculative decoding), the corresponding cache blocks are decremented. When a block's count reaches zero, it is immediately returned to the free pool. A periodic cycle detector (inspired by CPython's cyclic GC) handles the rare case where beam search creates reference cycles between hypothesis states.

Generational Collection: KV cache blocks are classified into three generations based on age (measured in inference steps since allocation). Young-generation blocks (recent tokens) are the most likely to be released soon; old-generation blocks (prefix tokens shared across many requests) are unlikely to be released. Collection sweeps focus on young-generation blocks, spending minimal time on old-generation blocks that are still active. This mimics the generational hypothesis from garbage collection research: most allocations die young.

Size-Class Memory Pooling: Rather than allocating each KV cache block from the general CUDA memory allocator (which is slow and fragmentation-prone), Infer-GC maintains pre-allocated pools of blocks in standard size classes (64, 128, 256, 512, 1024 tokens). Allocations are satisfied from the appropriate size-class pool in O(1) time. The pool eliminates fragmentation by ensuring free blocks are always the right size for new requests.

PagedAttention with Copy-on-Write: The attention mechanism is implemented using PagedAttention, where the KV cache is divided into fixed-size pages rather than contiguous allocations. Copy-on-write semantics allow multiple requests to share KV pages for common prefixes (system prompts, few-shot examples) without copying, with pages duplicated only when one request needs to modify shared content. This is the mechanism behind the 52.9% memory reduction for shared-prefix workloads.

三种GC技术应用:引用计数+循环检测(受CPython循环GC启发)实现KV缓存块即时释放。分代收集将缓存块按年龄分为三代,收集扫描聚焦于最可能很快释放的年轻代块,实现"大多数分配年轻时死亡"的分代假设。大小类内存池维护预分配的标准大小(64/128/256/512/1024令牌)块池,O(1)分配,消除碎片化。PagedAttention+写时复制允许多个请求共享公共前缀的KV页,仅在修改时复制,实现52.9%共享前缀场景的内存减少。

Results: Three Workload Scenarios

All results compared against a baseline vLLM-style serving engine without GC-aware memory management, on A100 GPU with Llama-3-8B:

  • Long context (8K–32K token sequences): 37.8% GPU memory reduction. The generational collector aggressively reclaims cache for completed prefix segments, preventing the unbounded growth that forces the baseline to reject long-context requests.
  • High concurrency (200+ simultaneous requests): 37.6% GPU memory reduction. Size-class pooling eliminates fragmentation, allowing the engine to serve 37% more concurrent requests in the same GPU memory budget.
  • Shared prefix workloads (same system prompt across requests): 52.9% GPU memory reduction. Copy-on-write PagedAttention stores the shared prefix KV cache once, serving hundreds of concurrent requests from the same physical memory.

Generation throughput (tokens/sec) is maintained within 2% of baseline across all three scenarios. The GC overhead — collection sweeps and reference count updates — occurs during inference micro-pauses (attention computation overlaps with collection) and does not appear on the critical path.

结果(A100 GPU, Llama-3-8B,对比vLLM风格基线):长上下文(8K-32K令牌):GPU内存减少37.8%,分代收集器积极回收完成前缀段的缓存。高并发(200+同时请求):内存减少37.6%,大小类池消除碎片化,同GPU内存预算下可服务37%更多并发请求。共享前缀(相同系统提示):内存减少52.9%,写时复制PagedAttention一次存储共享前缀缓存服务数百并发请求。生成吞吐量在所有场景下维持在基线2%以内。

Engineering and Observability

The 15,300-line codebase is structured around a clean separation between the GC core (reference counting, generational classification, pool management) and the attention engine (PagedAttention, KV cache access patterns). The 68 tests include memory accounting tests (verifying that allocations and frees balance exactly), regression tests for the cycle detector, and end-to-end generation tests that measure both correctness and memory usage.

Prometheus metrics expose: current memory pressure per generation, collection frequency and duration, cache hit rate for shared prefixes, and allocation failure rate. This observability enables operators to tune GC parameters (collection thresholds, pool sizes) for their specific workload characteristics without modifying the core code.

工程实现:15,300行代码库,GC核心(引用计数、分代分类、池管理)与注意力引擎清晰分离。68个测试包含内存计账测试(验证分配和释放精确平衡)、循环检测器回归测试和端到端生成测试。Prometheus指标暴露每代内存压力、收集频率和持续时间、共享前缀缓存命中率,支持运营人员根据工作负载特征调整GC参数。

// HIGHLIGHTS

  • 37.8% GPU memory reduction for long-context workloads (8K–32K tokens) — generational collection reclaims completed prefix cache
  • 37.6% GPU memory reduction for high-concurrency (200+ requests) — size-class pooling eliminates fragmentation
  • 52.9% GPU memory reduction for shared-prefix workloads — copy-on-write PagedAttention shares KV pages across hundreds of requests
  • Generation throughput maintained within 2% of baseline — GC overhead overlaps with attention computation
  • Three GC mechanisms: reference counting + cycle detection, generational collection (3 generations), size-class memory pooling (64–1024 token blocks)
  • PagedAttention with copy-on-write prefix sharing — multiple requests share identical KV cache pages for common system prompts
  • 15,300 LOC; 68 tests including memory accounting, cycle detector regression, and end-to-end generation correctness
  • Prometheus observability: per-generation memory pressure, collection latency, cache hit rate — operator-tunable GC parameters