Knowledge Distillation: A Comprehensive Guide

Knowledge Distillation (KD) transfers knowledge from a large "teacher" model to a smaller "student" model. This is crucial for deploying AI in resource-constrained environments.

Why Knowledge Distillation?

Deployment: Run models on edge devices
Cost: Reduce inference costs
Latency: Faster response times
Privacy: On-device processing

Methods Overview

Logit-Based Methods

Standard KD: Match softened output distributions
DKD: Decouple target class and non-target class knowledge
DIST: Focus on inter-class and intra-class relationships

Feature-Based Methods

FitNets: Match intermediate layer representations
AT (Attention Transfer): Transfer attention maps
PKD: Patient Knowledge Distillation for transformers

Relation-Based Methods

RKD: Relational Knowledge Distillation
CRD: Contrastive Representation Distillation

Advanced Methods

Fisher KD: Uses Fisher Information to weight important parameters
Adaptive KD: Dynamically adjusts distillation based on sample difficulty

Implementation Tips

Temperature tuning is critical (usually 3-20)
Layer mapping between different architectures requires care
Combine multiple methods for best results
Monitor both accuracy and efficiency metrics

FisherKD-Unified implements all these methods in a unified framework for easy comparison.