Two-Stage Approach for Targeted Knowledge Transfer in Self-Knowledge Distillation

Zimo Yin; Jian Pu; Yijie Zhou; Xiangyang Xue

doi:10.1109/JAS.2024.124629

Volume 11 Issue 11

Nov. 2024

IEEE/CAA Journal of Automatica Sinica

JCR Impact Factor: 19.2, Top 1 (SCI Q1)

CiteScore: 28.2, Top 1% (Q1)
Google Scholar h5-index: 95， TOP 5

Turn off MathJax

Article Contents

Article Navigation > IEEE/CAA Journal of Automatica Sinica > 2024 > 11(11): 2270-2283

Z. Yin, J. Pu, Y. Zhou, and X. Xue, “Two-stage approach for targeted knowledge transfer in self-knowledge distillation,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 11, pp. 2270–2283, Nov. 2024. doi: 10.1109/JAS.2024.124629

Citation:

Z. Yin, J. Pu, Y. Zhou, and X. Xue, “Two-stage approach for targeted knowledge transfer in self-knowledge distillation,” IEEE/CAA J. Autom. Sinica, vol. 11, no. 11, pp. 2270–2283, Nov. 2024. doi: 10.1109/JAS.2024.124629

Citation:

PDF( 2587 KB)

Two-Stage Approach for Targeted Knowledge Transfer in Self-Knowledge Distillation

doi: 10.1109/JAS.2024.124629

Funds: This work was supported by the National Natural Science Foundation of China (62176061)

More Information

Author Bio:
Zimo Yin received the bachelor degree and the master degree from Liaoning University in 2016 and 2019, respectively. He is currently pursuing the Ph.D. degree with the School of Computer Science, Fudan University. His research interests include computer vision and autonomous driving, especially focusing on data selection, reinforcement learning, and self-knowledge distillation

Jian Pu received the Ph.D. degree from Fudan University in 2014. Currently, he is a Young Principal Investigator at the Institute of Science and Technology for Brain-Inspired Intelligence (ISTBI), Fudan University. He was an Associate Professor at the School of Computer Science and Software Engineering, East China Normal University from 2016 to 2019, and a Postdoctoral Researcher of the Institute of Neuroscience, Chinese Academy of Sciences from 2014 to 2016. His current research interest include machine learning and computer vision methods for autonomous driving

Yijie Zhou (Student Member, IEEE) received the M.S. degree from the School of Computer Engineering, university of Manitoba, Canada, in 2007. He is currently pursuing the Ph.D. degree with the School of Computer Science, Fudan University. His research interests include motion planning and auto labeling in the field of automatic driving

Xiangyang Xue received the BS, MS, and Ph.D. degrees in communication engineering from Xidian University in 1989, 1992, and 1995, respectively. He is currently a Professor of computer science with Fudan University. His research interests include multimedia information processing and machine learning
Corresponding author: Jian Pu, e-mail: jianpu@fudan.edu.cn
Received Date: 2024-05-14
Accepted Date: 2024-06-09

Available Online: 2024-08-26

Abstract

Abstract

Knowledge distillation (KD) enhances student network generalization by transferring dark knowledge from a complex teacher network. To optimize computational expenditure and memory utilization, self-knowledge distillation (SKD) extracts dark knowledge from the model itself rather than an external teacher network. However, previous SKD methods performed distillation indiscriminately on full datasets, overlooking the analysis of representative samples. In this work, we present a novel two-stage approach to providing targeted knowledge on specific samples, named two-stage approach self-knowledge distillation (TOAST). We first soften the hard targets using class medoids generated based on logit vectors per class. Then, we iteratively distill the under-trained data with past predictions of half the batch size. The two-stage knowledge is linearly combined, efficiently enhancing model performance. Extensive experiments conducted on five backbone architectures show our method is model-agnostic and achieves the best generalization performance. Besides, TOAST is strongly compatible with existing augmentation-based regularization methods. Our method also obtains a speedup of up to 2.95x compared with a recent state-of-the-art method.
- Cluster-based regularization,
- iterative prediction refinement,
- model-agnostic framework,
- self-knowledge distillation (SKD),
- two-stage knowledge transfer

FullText(HTML)

References(49)

References

[1]	P. Chen, S. Liu, H. Zhao, and J. Jia, “Distilling knowledge via knowledge review,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 5008–5017.
[2]	J. Gou, B. Yu, S. J. Maybank, and D. Tao, “Knowledge distillation: A survey,” Int. J. Comput. Vis., vol. 129, pp. 1789–1819, 2021. doi: 10.1007/s11263-021-01453-z
[3]	G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” arXiv preprint arXiv: 1503.02531, 2015.
[4]	A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” arXiv preprint arXiv: 1412.6550, 2014.
[5]	N. Komodakis and S. Zagoruyko, “Paying more attention to attention: Improving the performance of convolutional neural networks via attention transfer,” in Proc. Int. Conf. Learn. Represent., 2017.
[6]	Y. Shen, L. Xu, Y. Yang, Y. Li, and Y. Guo, “Self-distillation from the last mini-batch for consistency regularization,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11943–11952.
[7]	P. Dong, L. Li, and Z. Wei, “DisWOT: Student architecture search for distillation without training,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2023, pp. 11898–11908.
[8]	Z. Huang, X. Shen, J. Xing, T. Liu, X. Tian, H. Li, B. Deng, J. Huang, and X.-S. Hua, “Revisiting knowledge distillation: An inheritance and exploration framework,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2021, pp. 3579–3588.
[9]	L. Li, “Self-regulated feature learning via teacher-free feature distillation,” in Proc. Eur. Conf. Comput. Vis. Springer, 2022, pp. 347–363.
[10]	S. Yun, J. Park, K. Lee, and J. Shin, “Regularizing class-wise predictions via self-knowledge distillation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 13876–13885.
[11]	C. Yang, Z. An, H. Zhou, L. Cai, X. Zhi, J. Wu, Y. Xu, and Q. Zhang, “MixSKD: Self-knowledge distillation from mixup for image recognition,” in Proc. Eur. Conf. Comput. Vis. Springer, 2022, pp. 534–551.
[12]	L. Zhang, J. Song, A. Gao, J. Chen, C. Bao, and K. Ma, “Be your own teacher: Improve the performance of convolutional neural networks via self distillation,” in Proc. IEEE/CVF Int. Conf. Comput. Vis., 2019, pp. 3713–3722.
[13]	K. Kim, B. Ji, D. Yoon, and S. Hwang, “Self-knowledge distillation with progressive refinement of targets,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2021, pp. 6567–6576.
[14]	C. Buciluǎ, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proc. 12th ACM SIGKDD Int. Conf. Knowl. Discov. Data Min., 2006, pp. 535–541.
[15]	J. Gou, L. Sun, B. Yu, S. Wan, W. Ou, and Z. Yi, “Multilevel attentionbased sample correlations for knowledge distillation,” IEEE Trans. Ind. Informat., vol. 19, no. 5, pp. 7099–7109, 2022.
[16]	Y. Zhang, T. Xiang, T. M. Hospedales, and H. Lu, “Deep mutual learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2018, pp. 4320–4328.
[17]	J. Kim, M. Hyun, I. Chung, and N. Kwak, “Feature fusion for online mutual knowledge distillation,” in Proc. IEEE 25th Int. Conf. Pattern Recognit., 2021, pp. 4619–4625.
[18]	J. Gou, L. Sun, B. Yu, L. Du, K. Ramamohanarao, and D. Tao, “Collaborative knowledge distillation via multiknowledge transfer,” IEEE Trans. Neural Netw. Learn. Syst., vol. 35, no. 5, pp. 6718–6730, 2022.
[19]	L. Yuan, F. E. Tay, G. Li, T. Wang, and J. Feng, “Revisiting knowledge distillation via label smoothing regularization,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 3903–3911.
[20]	J. Wang, P. Zhang, and Y. Li, “Memory-replay knowledge distillation,” Sensors, vol. 21, no. 8, p. 2792, 2021. doi: 10.3390/s21082792
[21]	M. Caron, P. Bojanowski, A. Joulin, and M. Douze, “Deep clustering for unsupervised learning of visual features,” in Proc. Eur. Conf. Comput. Vis., 2018, pp. 132–149.
[22]	X. Zhan, J. Xie, Z. Liu, Y.-S. Ong, and C. C. Loy, “Online deep clustering for unsupervised representation learning,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 6688–6697.
[23]	R. K. Keser, A. Ayanzadeh, O. A. Aghdam, C. Kilcioglu, B. U. Toreyin, and N. K. Ure, “Pursuhint: In search of informative hint points based on layer clustering for knowledge distillation,” Expert Syst. Appl., vol. 213, p. 119040, 2023. doi: 10.1016/j.eswa.2022.119040
[24]	J. O’Neill and S. Dutta, “Improved vector quantization for dense retrieval with contrastive distillation,” in Proc. 46th Int. ACM SIGIR Conf. Res. Dev. Inf. Retrieval, 2023, pp. 2072–2076.
[25]	C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 2818–2826.
[26]	M. Mohri, A. Rostamizadeh, and A. Talwalkar, Foundations of Machine Learning. MIT press, 2018.
[27]	E. Hoffer, I. Hubara, and D. Soudry, “Train longer, generalize better: Closing the generalization gap in large batch training of neural networks,” in Proc. AAAI Conf. Artif. Intell., vol. 30, 2017.
[28]	P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He, “Accurate, large minibatch SGD: Training imagenet in 1 hour,” arXiv preprint arXiv: 1706.02677, 2017.
[29]	Y. Fan, S. Lyu, Y. Ying, and B. Hu, “Learning with average top-k loss,” in Proc. Adv. Neural Inf. Process. Syst., vol. 30, 2017.
[30]	A. Krizhevsky and G. Hinton, “Learning multiple layers of features from tiny images,” Tech. Rep., 2009.
[31]	N. Sharma, V. Jain, and A. Mishra, “An analysis of convolutional neural networks for image classification,” Procedia Comput. Sci., vol. 132, pp. 377–384, 2018. doi: 10.1016/j.procs.2018.05.198
[32]	S. Yun, D. Han, S. J. Oh, S. Chun, J. Choe, and Y. Yoo, “Cutmix: Regularization strategy to train strong classifiers with localizable features,” in Proc. IEEE/CVF Int. Conf. Comput. Vis, 2019, pp. 6023–6032.
[33]	G. Xu, Z. Liu, X. Li, and C. C. Loy, “Knowledge distillation meets self-supervision,” in Proc. Eur. Conf. Comput. Vis. Springer, 2020, pp. 588–604.
[34]	C.-B. Zhang, P.-T. Jiang, Q. Hou, Y. Wei, Q. Han, Z. Li, and M.-M. Cheng, “Delving deep into label smoothing,” IEEE Trans. Image Process., vol. 30, pp. 5984–5996, 2021. doi: 10.1109/TIP.2021.3089942
[35]	K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2016, pp. 770–778.
[36]	S. Zagoruyko and N. Komodakis, “Wide residual networks,” arXiv preprint arXiv: 1605.07146, 2016.
[37]	G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 4700–4708.
[38]	S. Xie, R. Girshick, P. Dollár, Z. Tu, and K. He, “Aggregated residual transformations for deep neural networks,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2017, pp. 1492–1500.
[39]	M. Liu, Y. Yu, Z. Ji, J. Han, and Z. Zhang, “Tolerant self-distillation for image classification,” Neural Netw., p. 106215, 2024.
[40]	P. Liang, W. Zhang, J. Wang, and Y. Guo, “Neighbor self-knowledge distillation,” Inf. Sci., vol. 654, p. 119859, 2024. doi: 10.1016/j.ins.2023.119859
[41]	M. P. Naeini, G. Cooper, and M. Hauskrecht, “Obtaining well calibrated probabilities using bayesian binning,” in Proc. AAAI Conf. Artif. Intell., 2015.
[42]	Y. Geifman, G. Uziel, and R. El-Yaniv, “Bias-reduced uncertainty estimation for deep neural classifiers,” arXiv preprint arXiv: 1805.08206, 2018.
[43]	T. DeVries and G. W. Taylor, “Improved regularization of convolutional neural networks with cutout,” arXiv preprint arXiv: 1708.04552, 2017.
[44]	S. Irandoust, T. Durand, Y. Rakhmangulova, W. Zi, and H. Hajimirsadeghi, “Training a vision transformer from scratch in less than 24 hours with 1 GPU,” in Has It Trained Yet? NeurIPS 2022 Workshop, 2022. [Online]. Available: https://openreview.net/forum?id=sG0I6AIvq3
[45]	C. Yang, H. Zhou, Z. An, X. Jiang, Y. Xu, and Q. Zhang, “Cross-image relational knowledge distillation for semantic segmentation,” in Proc. IEEE/CVF Conf. Comput. Vis. Pattern Recognit., 2022, pp. 12319–12328.
[46]	L. Van der Maaten and G. Hinton, “Visualizing data using t-SNE,” J. Mach. Learn. Res., vol. 9, no. 11, 2008.
[47]	J. Xie, W. Kong, S. Xia, G. Wang, and X. Gao, “An efficient spectral clustering algorithm based on granular-ball,” IEEE Trans. Knowl. Data Eng., vol. 35, no. 9, pp. 9743–9753, 2023.
[48]	R. Zhang, H. Peng, Y. Dou, J. Wu, Q. Sun, Y. Li, J. Zhang, and P. S. Yu, “Automating dbscan via deep reinforcement learning,” in Proc. 31st ACM Int. Conf. Inf. Knowl. Manag., 2022, pp. 2620–2630.
[49]	L. Manduchi, M. Vandenhirtz, A. Ryser, and J. Vogt, “Tree variational autoencoders,” in Proc. Adv. Neural Inf. Process. Syst., vol. 36, 2023, pp. 54952–54986.

Supplements(0)

Cited By

Proportional views

Proportional views

通讯作者: 陈斌, bchen63@163.com

1.
沈阳化工大学材料科学与工程学院沈阳 110142

Figures(9) / Tables(8)

Get Citation

PDF

XML

Article Metrics

Article views (1437) PDF downloads(64)

Highlights

Propose a novel two-stage self-knowledge distillation approach for selective dark knowledge transfer
Generate class medoids from logit vectors to represent typical samples per class
Distill under-trained data using past predictions on half batch size
Experiments show wide compatibility and state-of-the-art generalization

Two-Stage Approach for Targeted Knowledge Transfer in Self-Knowledge Distillation

doi: 10.1109/JAS.2024.124629

Abstract

References

Proportional views

Catalog

通讯作者: 陈斌, bchen63@163.com

Article Metrics

Highlights

Export File

Citation

Format

Content