LinkerLLM
Multi-model LLM serving system with weight section sharing inspired by runtime linkers
// DESCRIPTION
Weight-Sharing Runtime for Multi-Model LLM Serving
LinkerLLM is a serving runtime that exploits weight sharing across multiple fine-tuned LLM variants to dramatically reduce memory consumption and improve throughput in multi-tenant deployment scenarios. When organizations serve dozens of task-specific fine-tunes of the same base model, LinkerLLM identifies and deduplicates shared weight blocks at the tensor level.
The runtime implements a copy-on-write memory management strategy where base model weights are loaded once and shared across all variants, with only the delta weights (LoRA adapters, fine-tuned layers) allocated per-model. A custom CUDA kernel handles the dynamic weight composition at inference time, fusing base and delta weights with minimal latency overhead.
In benchmarks with 16 concurrent model variants of Llama-2-7B, LinkerLLM achieves 12x memory reduction compared to naive per-model loading, while maintaining 95%+ of single-model inference throughput. The scheduler intelligently batches requests across variants that share the same base layers, further amortizing compute costs.
The system supports hot-swapping of model variants without service interruption and integrates with standard serving frameworks (vLLM, TGI) as a drop-in memory management backend.
// HIGHLIGHTS
- 12x memory reduction when serving 16 concurrent fine-tuned model variants
- Copy-on-write tensor management with custom CUDA kernel for dynamic weight composition
- 95%+ throughput retention compared to dedicated single-model serving
- Hot-swappable model variants with zero-downtime deployment
- Compatible with vLLM and TGI as a drop-in memory management backend