COMPLETED Research

Infer-GC

Garbage Collection-style memory management for LLM inference achieving 37-52% GPU memory reduction

// DESCRIPTION

Infer-GC applies garbage collection concepts from programming languages to optimize memory usage during LLM inference.

Core Concept

Traditional LLM Memory:
┌──────────────────────────────────────────────────┐
│ KV Cache grows unbounded → OOM                   │
│ [Token 1][Token 2][Token 3]...[Token N] 💥      │
└──────────────────────────────────────────────────┘

Infer-GC Memory:
┌──────────────────────────────────────────────────┐
│ GC identifies and reclaims unused memory         │
│ [Active][Active][Freed][Active][Freed][Active]  │
│ Reference counting + Memory pooling              │
└──────────────────────────────────────────────────┘

GC Strategies

  • Reference Counting: Track active token references
  • Memory Pooling: Reuse freed memory blocks
  • KV Cache Optimization: Compress old cache entries
  • Lazy Deallocation: Batch memory reclamation

Results

ModelMemory ReductionThroughput Impact
LLaMA-7B37%-2%
LLaMA-13B45%-3%
Mistral-7B52%-1%

Key Innovation

Unlike traditional memory optimization that requires model modification, Infer-GC works as a runtime layer that can be applied to any transformer-based model.

// HIGHLIGHTS

  • Novel application of GC concepts to ML
  • 37-52% memory reduction demonstrated
  • Minimal throughput impact (<3%)
  • Drop-in replacement for existing inference

TECH_STACK

Python PyTorch CUDA Transformers

PROJECT_INFO

started: 2024-08-01
completed: 2025-01-01
status: COMPLETED
type: Research