Improving Knowledge Distillation via Regularizing Feature Norm and Direction

인공지능/Knowledge Distillation

Improving Knowledge Distillation via Regularizing Feature Norm and Direction

NickTop 2024. 8. 16. 22:50

Overview

penultimate layer (logit 직전의 layer)에 Lnd를 추가하여 성능을 높이는 방법입니다

Motivation

penultimate layer에 L2-loss를 적용하는 것이 모델 성능 향상에 직접적인 영향을 주지 않습니다

model pruning and domain adaptation를 통해 large-norm features(norm=size)의 중요성을 알 수 있습니다

예를들어, model pruning의 경우 특정 threshold이하의 값을 0으로 만들어도 성능이 크게 줄어들지 않는 것을 보여줬습니다.

따라서, feature의 값을 작게 하는 것보다 크게 하는 것이 모델에 더 긍정적인 영향을 끼칠 것으로 예상할 수 있습니다

Regularization

어떤 방향으로 값을 커지게 만들까?

class mean 방향으로 값을 커지게 만듭니다

ND Loss = Norm(norm이 커지게) and Direction(class mean 방향으로) loss

$c_k = \frac{1}{|I_k|}\sum_{j \in I_k} f^t_j$

class mean은 배치 단위로 계산이 된다

$I_k$는 k번째 배치의 데이터 개수이다

저자가 코드를 구현할 때는 매 배치마다 계산하지 않고, 미리 계산하고 파일로 저장하여 쓰고 있다

if $| p^t | > |p^s |$

$|| p^t - p^s ||_2$의 값이 최소화가 되게 학습합니다

단, 이렇게만 할 경우 값이 너무 커질 수 있기 때문에 $|| p^t ||_2$로 값을 나눠 loss를 제한합니다

$\mathcal{L}_{nd} = \frac{\|\mathbf{p}^t - \mathbf{p}^s\|_2}{\|\mathbf{f}^t\|_2} = \frac{\|\mathbf{p}^t\|_2 - \|\mathbf{p}^s\|_2}{\|\mathbf{f}^t\|_2} = 1 - \frac{\mathbf{f}^s \cdot \mathbf{e}}{\|\mathbf{f}^t\|_2} $

$e = \frac{c}{|| c ||_2}$

if $| p^t | > |p^s |$

teacher의 도움을 받아 norm을 크게 할 필요는 없다

class 방향으로만 학습하자 => $|| f^s - p^s ||_2$의 값이 최소화가 되게 학습한다

이경우는 $|| f^s ||_2$로 값을 나눠 loss를 제한합니다

$\mathcal{L}_{nd} = 1 - \frac{\mathbf{f}^s \cdot \mathbf{e}}{\|\mathbf{f}^s\|_2} $

식을 합치면

$\mathcal{L}_{nd} = - \frac{1}{C} \sum_{k=1}^{C} \frac{1}{|I_k|} \sum_{i \in I_k} \frac{\mathbf{f}^s_i \cdot \mathbf{e}_k}{\max \{ \|\mathbf{f}^s_i\|_2, \|\mathbf{f}^t_i\|_2 \}} $

1은 constant라 미분하면 상수가 되어 빼도 괜찮습니다

(최종식이 $f^t$와 $f^s$의 norm의 차이가 크면 loss가 커져야하는데 오히려 줄어드는 방향이라서 설명과 일치하지 않는 것 같네요)

Experiment

이 방법은 다른 방법들에 loss를 추가하여 적용할 수 있습니다. vanilla KD에서는 정확도가 크게 상승하고 나머지 방법들에 대해서는 소폭 상승했습니다

https://arxiv.org/pdf/2305.17007

'인공지능 > Knowledge Distillation' 카테고리의 다른 글

Logit Standardization in Knowledge Distillation (0)	2024.08.08
Revisit the Power of Vanilla Knowledge Distillation: from Small Scale to Large Scale (0)	2024.07.31
Knowledge Distillation from A Stronger Teacher (0)	2024.07.14
Self Distillation - Be Your Own Teacher (0)	2024.04.12

현재글Improving Knowledge Distillation via Regularizing Feature Norm and Direction

개발일기장