Knowledge Distillation (KD) transfers knowledge from a large "teacher" model to a smaller "student" model. This is crucial for deploying AI in resource-constrained environments.
Why Knowledge Distillation?
- Deployment: Run models on edge devices
- Cost: Reduce inference costs
- Latency: Faster response times
- Privacy: On-device processing
Methods Overview
Logit-Based Methods
- Standard KD: Match softened output distributions
- DKD: Decouple target class and non-target class knowledge
- DIST: Focus on inter-class and intra-class relationships
Feature-Based Methods
- FitNets: Match intermediate layer representations
- AT (Attention Transfer): Transfer attention maps
- PKD: Patient Knowledge Distillation for transformers
Relation-Based Methods
- RKD: Relational Knowledge Distillation
- CRD: Contrastive Representation Distillation
Advanced Methods
- Fisher KD: Uses Fisher Information to weight important parameters
- Adaptive KD: Dynamically adjusts distillation based on sample difficulty
Implementation Tips
- Temperature tuning is critical (usually 3-20)
- Layer mapping between different architectures requires care
- Combine multiple methods for best results
- Monitor both accuracy and efficiency metrics
FisherKD-Unified implements all these methods in a unified framework for easy comparison.