[IEEE TMM] DCRP: Class-Aware Feature Diffusion Constraint and Reliable Pseudo-labeling for Imbalanced Semi-Supervised Learning

논문 리뷰

[IEEE TMM] DCRP: Class-Aware Feature Diffusion Constraint and Reliable Pseudo-labeling for Imbalanced Semi-Supervised Learning

Yejin Kim 2024. 3. 17. 11:35

이 논문은 IEEE Transactions of Multimedia에 출판되었으며, 저자는 Xiaoyu Guo, Xiang Wei, Shunli Zhang, Wei Lu, Weiwei Xing이다.

Motivation

Imbalanced Semi-supervised Learning (ISSL) 에는 두 가지 큰 문제점이 있다.

1. 믿을만한 pseudo-label을 만들어 내는 것이 어렵다.

2. 각 class들에 대해 balanced 된 feature를 만들어내기가 어렵다.

이 논문은 위의 두 문제점을 해결하기 위한 새로운 프레임워크를 제안하고자 한다.

Method

저자들이 제시한 DCRP의 모델의 overview이다. 하나씩 살펴보도록 하겠다.

A. “K+1” for Outlier Elimination and Pseudo–labeling

먼저 PL loss를 살펴보도록 하자.

이 논문에서는 softmax 함수로 인한 model의 over-confidence 문제를 지적한다. Outlier들은 전형적으로 softmax를 거치기 전 낮은 value 값을 갖는다. 하지만 softmax 함수를 지나면 상대적인 값에 의해 임의의 높은 confidence 값이 생성되는 경우가 많다.

위의 그림 (a)는 기존의 K-class 분류 방식에서, unlabeled training data의 pseudo-label을 시각화한 것이다. 가로축은 confidence, 세로축은 softmax 전의 logit 값을 의미한다. 네모 박스 내의 데이터들은 낮은 maximum logit 값을 가지지만 softmax에 의해 높은 confidence 값을 갖게된 데이터들이라고 해석할 수 있다. (a)는 작은 logit 값을 가지는 unlabeled data가 오분류에 더 취약하다는 것을 보여주고 있다.

논문 InPL에서도 outlier 검출을 위해 logit 값을 유심히 살펴보고 있다. 이 논문에서 제시한 energy-based approach를 활용해서도 효과적으로 outlier를 제거할 수 있으나, 이 방식은 non-probabilistic하기 때문에 outlier 검출을 위한 적절한 threshold를 선정하는 것이 어렵다는 단점이 있다.

저자들은 이 문제를 해결하기 위해 K-class 분류 문제를 K+1-class 분류 문제로 변환한다. 이는 logit에 nonparameteric한 scalar를 하나 추가하여 구현한다.

$ξ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ξ</mi></math>$ 는 outlier의 sensitivity를 조절하는 하이퍼파라미터다. $ξ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ξ</mi></math>$ 의 값이 크게 되면, data가 outlier class (k+1 번째 class)로 분류될 가능성이 커지게 되므로 더 많은 data를 outlier로 검출하게 된다.

저자들은 아까 살펴봤던 그림 (a)을 통해 maximum logit이 5 이하인 데이터를 잘 걸러내면 오분류의 가능성이 낮아질 것이라고 생각했다. 기본적인 FixMatch의 confidence threshold 0.95를 고려했을 때 이러한 데이터들을 걸러내기 위해서는 $ξ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ξ</mi></math>$ 는 $2 (l o g (0.05 \times e x p (5))) <math xmlns="http://www.w3.org/1998/Math/MathML"><mn>2</mn><mo stretchy="false">(</mo><mi>l</mi><mi>o</mi><mi>g</mi><mo stretchy="false">(</mo><mn>0.05</mn><mo>\times</mo><mi>e</mi><mi>x</mi><mi>p</mi><mo stretchy="false">(</mo><mn>5</mn><mo stretchy="false">)</mo><mo stretchy="false">)</mo><mo stretchy="false">)</mo></math>$ 와 같이 계산할 수 있다고 한다.

이렇게 만든 K+1-class logit에 softmax를 적용하고,

K+1 번째 차원을 제거하여 다시 K-class 분류 문제로 돌려 놓는다.

이렇게 생성한 pseudo-label을 활용하여 loss를 위와 같이 재구성한다.

B. Class-Aware Feature Diffusion Constraint

다음은 FDC loss이다.

FDC loss는 위의 그림처럼 majority class가 feature space에서 차지하는 영역이 minority에 비해 너무 넓다는 문제를 해결하기 위한 loss이다. 저자들은 majority class가 feature space에서 차지하는 영역이 비교적 작아지도록 제한을 줌으로써 balanced classifier를 만들고자 한다.

FDC loss의 형태는 위와 같다. 이 loss는 unlabeled data에 strong augmentation을 적용한 데이터를 MLP $ω <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ω</mi></math>$ 에 태워 얻은 feature와 weak augmentation을 적용한 데이터의 feature 간의 distance를 줄이는 것이다. 여기서 distance는 cosine similairty 를 기반으로 계산한다. $η i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>η</mi><mi>i</mi></msub></math>$ 는 class-specific variable로, 각 클래스에 대한 데이터 확산 정도를 조절하는 변수이다. 값은 다음과 같이 결정된다.

$μ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>μ</mi></math>$ 는 hyperparameter이며 $P <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">P</mi></mrow></math>$ 는 reversed normalized class-number vector이다. Majority class일수록 작은 $P <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">P</mi></mrow></math>$ 값을 갖는다는 의미이다. 이렇게 되면 u_i가 majority class일 때 $η i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>η</mi><mi>i</mi></msub></math>$ 가 작아지게 되고, 이를 통해 weak augmentation과 strong augmentation 사이의 distance가 더 작아지도록 제한하는 효과를 줄 수 있다.

C. Additional Helpful Techniques

또한 이 프레임워크에서는 balanced branch를 만들기 위해 간단한 re-weighting 기법을 사용한다.

파트 A에서 설명한 PL loss에도 $P <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">P</mi></mrow></math>$ 를 적용하여 loss에 대한 minority class의 영향력을 높이며, labeled loss와 unlabeled loss에도 이를 동일하게 적용한다.

Pseudo-label loss에 더불어 이 프레임워크에서는 unlabeled loss를 추가적으로 사용한다. 비록 pseudo-labeling이 되지 않는 데이터이더라도 어떤 데이터의 weak augmented 버전과 strong augmented 버전에 대한 prediction이 동일해야 한다는 consistency loss는 여전히 적용할 수 있다. 이 때, Pseudo-labeling이 된 reliable한 데이터에 대해서는 PL loss와 U loss를 통해 두 번 loss가 반영되도록 함으로써 reliable unlabeled data에 대한 강조효과는 유지한다.

Adaptive class-specific threshold 기법도 사용한다.

$\leftarrow P <math xmlns="http://www.w3.org/1998/Math/MathML"><mover><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">P</mi></mrow><mo>\leftarrow</mo></mover></math>$ 는 위에서 정의한 $P <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">P</mi></mrow></math>$ 의 reversed version이다. 즉, majority class에 대해 더 큰 값을 가진다. $ˆ τ <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>τ</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 는 base confidence threshold이다. 이 기법을 통해 majority class일수록 pseudo-labeling을 위한 confidence threshold를 더 높게 둔다.

D. The overall Workflow of DCRP

최종 loss는 다음과 같이 구성된다.

Experiments

A. Experimental Setup

실험은 CIFAR10/100-LT, SVHN-LT, 그리고 Small ImageNet-127에 대해 두 가지 codebase로 진행했다. 하이퍼파라미터 세팅은 다음과 같다.

$ξ <math xmlns="http://www.w3.org/1998/Math/MathML"><mi>ξ</mi></math>$ 는 위에서 계산한 최적의 값과 달리, 단순화를 위해 1.0을 선택했다고 한다. 또한 $ˆ τ <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mover><mi>τ</mi><mo stretchy="false">^</mo></mover></mrow></math>$ 는 training 과정을 따라 선형적으로 증가하도록 했다고 한다.

B. Experiments on the Codebase of ABC

ABC를 따라, 파트 B는 마지막 에폭에서의 accuracy를 보고한다.

CIFAR10-LT에서는 모든 세팅에 대해 DCRP가 overall accuracy와 minority-class accuracy에서 모두 좋은 결과를 보이고 있다.

CIFAR100-LT에서도 DCRP가 overall accuracy에 대해서는 모두 best performance를 보이고 있다. 하지만 이번에는 TRAS가 DCRP보다 높은 minority-class accuracy을 보이고 있는데, 저자들은 TRAS가 imbalance를 더 강하게 교정하는 바람에, majority class의 정확도에서 성능 감소가 발생했음을 주장했다.

SVHN에 대해서도 DCRP가 대부분의 세팅에 대해 우세한 성능을 보이고 있으며, minority class에 대해서도 좋은 성능을 보인다는 점에서 DCRP의 알고리즘이 classifier balancing에 효과적이라고 주장한다.

Small ImageNet-127은 ImageNet-127 데이터셋의 resolution-reduced version이다. 127개의 coarse한 클래스로 구성되어 있고, 이는 ImageNet 1K 데이터에서 WordNet의 top-down hierarchy에 따라 생성되었다고 한다. Test set이 이미 class imbalance 하기 때문에, average class recall rate를 측정하였다.

Table Ⅵ 를 통해 resolution 32 x 32와 64 x 64 에서 모두 큰 차이로 다른 메소드들의 성능을 능가했음을 확인할 수 있다. 특히 32 x 32 에서의 52.5% accuracy는 다른 메소드들의 64 x 64에서의 성능과 비등할 정도다.

C. Experiments on the Codebase of ACR

제안된 메소드의 보편성을 위해, 당시 가장 최근에 출판된 ACR을 기반으로도 실험을 수행했다. ACR 논문을 따라 best accuracy를 측정하였으며, overall accuracy에 대한 성능 보고를 진행한다.

DCRP의 key component인 "K+1"과 FDC는 어렵지 않게 ACR에 통합이 될 수 있다. 그러므로 여기서 DCRP에 대한 성능은 ACR에 두 개의 key innovation을 통합한 방식을 활용하여 측정한다.

Table Ⅶ에서 알 수 있듯이, DCRP는 ACR이 다양한 ISSL 조건에서 1%가량 성능이 향상될 수 있도록 하는 역할을 한다.

D. Systematic Analysis

제안된 메소드의 효과성을 살펴보기 위해, t-SNE representation과 함께 prediction에 대한 confusion matrix를 보여주고 있다. 실험은 CIFAR10에 대하여 진행했다.

Fig 4의 (f)와 Fig 5의 (f)를 통해 DCRP가 minority class에 대해 높은 test accuracy를 보이는 것을 확인할 수 있다. 또한 이에 대한 시각화된 자료는 두 Fig의 (e)를 통해 살펴볼 수 있다.

또한, 저자들은 Fig 5의 (c)와 (e)를 비교하면 majority class에 해당하는 green class의 확산 정도가 상당히 줄어들었다고 주장한다. 따라서 이 메소드는 feature space에서 majority class가 over-represented되는 것을 완화함으로써 minority class에 대한 성능을 높일 수 있다고 이야기한다.

E. Ablation study

w/ $𝟙 P = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">P</mi></mrow><mo>=</mo><mrow data-mjx-texclass="ORD"><mn mathvariant="double-struck">1</mn></mrow></math>$ in Eq. (7):
Eq. (7)의 경우 $η i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>η</mi><mi>i</mi></msub></math>$ 에 대한 식인데, 여기에 $𝟙 P = 1 <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">P</mi></mrow><mo>=</mo><mrow data-mjx-texclass="ORD"><mn mathvariant="double-struck">1</mn></mrow></math>$ 을 적용함으로써 uniform feature diffusion constraint 세팅을 만든 것이다.
w/o $P <math xmlns="http://www.w3.org/1998/Math/MathML"><mrow data-mjx-texclass="ORD"><mi mathvariant="double-struck">P</mi></mrow></math>$ in Eq. (10):
Eq. (10)은 PL loss에 대한 식이다. PL loss에 대해 re-weighting을 적용하지 않는 것을 의미한다.
w/o Eq. (11):
Eq. (11)은 U loss 식이다. unsupervised re-weighting loss를 사용하지 않는 것을 의미한다.

Table Ⅷ 를 통해 다음과 같은 결론을 얻을 수 있다.

1. "K+1" 전략은 class-aware feature diffusion constraint와 결합되었을 때, 상당한 성능 향상을 도모한다.

2. 모든 class에 같은 강도로 diffusion constraint를 적용하는 것은 모델의 balanced performance를 향상시키는데 효과적이지 않다.

3. 간단한 re-weighting loss는 효과적으로 dual-branch 모델의 성능 향상을 이끈다.

4. Eq. (11) without "K+1" and FDC component는 상당한 성능 하락을 만든다.

5. Adaptive class-specific threshold는 성능을 더 향상시키는데 도움이 된다.

6. DCRP의 두가지 핵심 innovation은 original FixMatch에 적용하더라도 일관적으로 성능 향상을 만들어낸다.

F. Feature Space Analysis of DCRP

Fig 7에서 알 수 있듯이 ACR에 DCRP를 적용한 경우, original data와 strongly augmented version 사이의 divergence가 확실히 제한되었음을 살펴볼 수 있다.

Conclusion

contribution

K-class 분류 문제를 K+1-class 분류 문제로 변형하여, outlier가 오분류되는 문제를 해결함. 이 방식은 오분류된 outlier로 인한 문제들을 완화할 수 있도록 함.
모델의 feature extractor에 class-aware feature diffusion constraint를 가함. 이 방식은 strong augmentation으로부터 얻어지는 feature의 다양성을 효과적으로 밸런싱한다.
여러 ISSL 벤치마크에 다양한 imbalance ratio를 적용한 semi-supervised 세팅에서 DCRP 프레임워크의 효과성을 증명했음.

limitation

이 연구에서는 labeled data와 unlabeled data가 동일한 imbalance ratio를 따른다고 가정하고 있음.
더 정데된 diffusion range constraint (i.e., $η i <math xmlns="http://www.w3.org/1998/Math/MathML"><msub><mi>η</mi><mi>i</mi></msub></math>$ )는 알고리즘의 효과성을 더 높일 수 있지만, 단순함을 위해 extensive exploration을 하지 않았음.

'논문 리뷰' 카테고리의 다른 글

[NeurIPS 2021] ABC: Auxiliary Balanced Classifier for Class-imbalanced Semi-supervised Learning (0)	2024.03.31
[ICLR 2021] Long-tail Learning via Logit Adjustment (2)	2024.03.23
[CVPR 2023] Transfer Knowledge from Head to Tail: Uncertainty Calibration under Long-tailed Distribution (0)	2024.03.03
[CVPR 2020] M2m: Imbalanced Classification via Major-to-minor Translation (1)	2024.02.26
[ICLR 2023 Spotlight] CUDA: Curriculum of Data Augmentation for Long-tailed Recognition (7)	2024.02.17

현재글[IEEE TMM] DCRP: Class-Aware Feature Diffusion Constraint and Reliable Pseudo-labeling for Imbalanced Semi-Supervised Learning

🎓 서강대 인공지능학과 석사과정 https://sites.google.com/view/yejin-c-kim/

uncertainty, Class imbalance, WS, Contrastive Learning, Unbias, Dataset Bias, Data Augmentation, NeurIPS, Donsker-Varadhan representation, Semi-supervised learning, Chi, Network Calibration, Diffusion, Long-tail Learning, DV representation, Data Imbalance, Long-tail, CVPR, ICLR, ICCV,

Today :
Yesterday :

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

멋찐 병아리 대학원생 🐥