Knowledge Distillation Methods Comparison

高级
最后更新:2026年01月15日 08:45
创建于:2026年01月15日

Comprehensive comparison of 12+ knowledge distillation methods with implementation details

machine-learning knowledge-distillation model-compression

Knowledge Distillation Methods Comparison

Overview

Knowledge Distillation (KD) compresses large models into smaller ones while preserving performance.

Method Categories

1. Logit-Based Methods

Standard KD (Hinton et al., 2015)

\[L_{KD} = \alpha L_{CE}(y, p_s) + (1-\alpha) T^2 KL(p_t^T || p_s^T)\]

Where: - \(p_t^T, p_s^T\): softened teacher/student outputs at temperature \(T\) - \(\alpha\): balancing coefficient

def kd_loss(student_logits, teacher_logits, labels, T=4.0, alpha=0.5):
    soft_targets = F.softmax(teacher_logits / T, dim=-1)
    soft_student = F.log_softmax(student_logits / T, dim=-1)

    kd_loss = F.kl_div(soft_student, soft_targets, reduction='batchmean') * T * T
    ce_loss = F.cross_entropy(student_logits, labels)

    return alpha * ce_loss + (1 - alpha) * kd_loss

DKD (Decoupled KD)

Separates target class knowledge distillation (TCKD) from non-target class knowledge distillation (NCKD).

2. Feature-Based Methods

FitNets

Match intermediate representations:

\[L_{FitNet} = ||W_s \cdot F_s - F_t||^2\]

Where \(W_s\) is a learnable projection matrix.

Attention Transfer (AT)

Transfer attention maps:

\[L_{AT} = \sum_j ||\frac{Q_s^j}{||Q_s^j||_2} - \frac{Q_t^j}{||Q_t^j||_2}||_2\]

3. Relation-Based Methods

RKD (Relational KD)

Transfers relationships between samples:

\[L_{RKD} = L_{distance} + L_{angle}\]

CRD (Contrastive Representation Distillation)

Uses contrastive learning objective:

\[L_{CRD} = -\log \frac{\exp(f_s \cdot f_t^+ / \tau)}{\sum_i \exp(f_s \cdot f_t^i / \tau)}\]

4. Fisher-Based Methods

Fisher KD

Weights parameters by Fisher Information:

\[L_{Fisher} = \sum_i \mathcal{F}_i (\theta_s^i - \theta_t^i)^2\]

Fisher Information measures parameter importance.

Comparison Table

Method Type Pros Cons Best For
Standard KD Logit Simple, effective Requires same output dim Classification
DKD Logit Better separation More hyperparameters Fine-grained tasks
FitNets Feature Works across architectures Needs layer mapping Different archs
AT Feature Preserves attention Limited to attention-based Transformers
RKD Relation Architecture agnostic Batch size sensitive Few-shot learning
CRD Relation Strong performance Computationally heavy Representation learning
Fisher KD Parameter Principled selection Expensive to compute Critical applications

Implementation Tips

  1. Temperature Selection:
  2. Start with T=4, tune in range [1, 20]
  3. Higher T → softer distributions → more knowledge transfer

  4. Layer Mapping:

  5. For different architectures, use projections
  6. Match layers with similar semantic roles

  7. Loss Balancing:

  8. Use validation set to tune α
  9. Task-specific vs. distillation trade-off

  10. Multi-Method Combination:

    total_loss = ce_loss + 0.3 * kd_loss + 0.2 * feature_loss + 0.1 * relation_loss

Resources

笔记信息

难度级别:高级
创建时间:2026年01月15日
最后更新:2026年01月15日 08:45