WIP Research

LinkerLLM

Multi-model LLM serving system with weight section sharing inspired by runtime linkers

// DESCRIPTION

Exploring how runtime linker concepts can enable efficient multi-model serving on limited hardware.

Concept

Traditional: Model A [Weights A] + Model B [Weights B] = 2x Memory

LinkerLLM: Model A [Shared + A-specific]
         + Model B [Shared + B-specific] = ~1.3x Memory

Approach

  • Weight Section Identification: Find shared vs. unique parameters
  • Dynamic Loading: Load shared weights once, switch model-specific
  • Memory Mapping: Efficient weight access patterns

Target Hardware

Dual RTX 3090 (48GB total) serving multiple 7B+ models

// HIGHLIGHTS

  • Novel runtime linker approach
  • Multi-model serving research
  • Experimental framework in progress

TECH_STACK

Python PyTorch CUDA Transformers

PROJECT_INFO

started: 2024-11-01
status: WIP
type: Research