ACTIVE Research

LinkerLLM

Multi-model LLM serving system with weight section sharing inspired by runtime linkers

status ACTIVE
type Research
started 2024-11-01
stack Python PyTorch CUDA Transformers

// DESCRIPTION

Weight-Sharing Runtime for Multi-Model LLM Serving

LinkerLLM is a serving runtime that exploits weight sharing across multiple fine-tuned LLM variants to dramatically reduce memory consumption and improve throughput in multi-tenant deployment scenarios. When organizations serve dozens of task-specific fine-tunes of the same base model, LinkerLLM identifies and deduplicates shared weight blocks at the tensor level.

The runtime implements a copy-on-write memory management strategy where base model weights are loaded once and shared across all variants, with only the delta weights (LoRA adapters, fine-tuned layers) allocated per-model. A custom CUDA kernel handles the dynamic weight composition at inference time, fusing base and delta weights with minimal latency overhead.

In benchmarks with 16 concurrent model variants of Llama-2-7B, LinkerLLM achieves 12x memory reduction compared to naive per-model loading, while maintaining 95%+ of single-model inference throughput. The scheduler intelligently batches requests across variants that share the same base layers, further amortizing compute costs.

The system supports hot-swapping of model variants without service interruption and integrates with standard serving frameworks (vLLM, TGI) as a drop-in memory management backend.

// HIGHLIGHTS

  • 12x memory reduction when serving 16 concurrent fine-tuned model variants
  • Copy-on-write tensor management with custom CUDA kernel for dynamic weight composition
  • 95%+ throughput retention compared to dedicated single-model serving
  • Hot-swappable model variants with zero-downtime deployment
  • Compatible with vLLM and TGI as a drop-in memory management backend