Improved Knowledge Distillationwith Dynamic Network Pruning

Eren Şener; Emre Akbaş

doi:10.29109/gujsc.1141648

EN

Improved Knowledge Distillationwith Dynamic Network Pruning

Abstract

Deploying convolutional neural networks to mobile or embedded devices is often prohibited by limited memory and computational resources. This is particularly problematic for the most successful networks, which tend to be very large and require long inference times. Many alternative approaches have been developed for compressing neural networks based on pruning, regularization, quantization or distillation. In this paper, we propose the “Knowledge Distillation with Dynamic Pruning” (KDDP), which trains a dynamically pruned compact student network under the guidance of a large teacher network. In KDDP, we train the student network with supervision from the teacher network, while applying L1 regularization on the neuron activations in a fully-connected layer. Subsequently, we prune inactive neurons. Our method automatically determines the final size of the student model. We evaluate the compression rate and accuracy of the resulting networks on an image classification dataset, and compare them to results obtained by Knowledge Distillation (KD). Compared to KD, our method produces better accuracy and more compact models.

Keywords

References

[1] Y. LeCun, J. S. Denker, S. A. Solla, R. E. Howard, and L. D. Jackel, “Optimal brain damage.,” in Advances in Neural Processing Systems (NIPS Conference), vol. 2, pp. 598–605, 1989.
[2] B. Hassibi and D. Stork, “Second order derivatives for network pruning: Optimal brain surgeon,” in Advances in Neural Information Processing Systems 5 (NIPS Conference), pp. 164–171, 1992.
[3] S. Srinivas and R. V. Babu, “Data-free parameter pruning for deep neural networks,” in Proceedings of the British Machine Vision Conference 2015, BMVC 2015, Swansea, UK, September 7-10, 2015, pp. 31.1–31.12, 2015
[4] S. Han, J. Pool, J. Tran, and W. Dally, “Learning both weights and connections for efficient neural network,” in Advances in Neural Information Processing Systems (NIPS Conference), pp. 1135–1143, 2015.
[5] H. Zhou, J. M. Alvarez, and F. Porikli, “Less is more: Towards compact cnns,” in European Conference on Computer Vision (ECCV), pp. 662–677, Springer, 2016.
[6] W. Wen, C. Wu, Y. Wang, Y. Chen, and H. Li, “Learning structured sparsity in deep neural networks,” in Advances in Neural Information Processing Systems (NIPS Conference), pp. 2074–2082, 2016.
[7] Z. Liu, J. Li, Z. Shen, G. Huang, S. Yan, and C. Zhang, “Learning efficient convolutional networks through network slimming,” in Proceedings of the IEEE International Conference on Computer Vision (ICCV), pp. 2736–2744, 2017.
[8] J. Jin, A. Dundar, and E. Culurciello, “Flattened convolutional neural networks for feedforward acceleration,” in 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, May 7-9, 2015, Workshop Track Proceedings, 2015.

[9] F. N. Iandola, M. W. Moskewicz, K. Ashraf, S. Han, W. J. Dally, and K. Keutzer, “Squeezenet: Alexnet-level accuracy with 50x fewer parameters and <1mb model size,” CoRR, vol. abs/1602.07360, 2016.
[10] T. Li, B. Wu, Y. Yang, Y. Fan, Y. Zhang, and W. Liu, “Compressing convolutional neural networks via factorized convolutional filters,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 3977–3986, 2019.
[11] E. L. Denton, W. Zaremba, J. Bruna, Y. LeCun, and R. Fergus, “Exploiting linear structure within convolutional networks for efficient evaluation,” in Advances in Neural Information Processing Systems 27 (NIPS Conference), December 8-13, Montreal, Quebec, Canada, pp. 1269–1277, 2014.
[12] H. Kim, M. U. K. Khan, and C.-M. Kyung, “Efficient neural network compression,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 12569–12577, 2019.
[13] B. Minnehan and A. Savakis, “Cascaded projection: End-to-end network compression and acceleration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 10715–10724, 2019.
[14] M. Courbariaux, Y. Bengio, and J. David, “Binaryconnect: Training deep neural networks with binary weights during propagations,” in Advances in Neural Information Processing Systems 28 (NIPS Conference), Montreal, Quebec, Canada, pp. 3123–3131, 2015.
[15] S. Han, H. Mao, and W. J. Dally, “Deep compression: Compressing deep neural network with pruning, trained quantization and huffman coding,” in 4th International Conference on Learning Representations, ICLR, 2016.
[16] I. Hubara, M. Courbariaux, D. Soudry, R. El-Yaniv, and Y. Bengio, “Binarized neural networks,” in Advances in Neural Information Processing Systems (NIPS Conference), pp. 4107–4115, 2016.
[17] G. Hinton, O. Vinyals, and J. Dean, “Distilling the knowledge in a neural network,” in Deep Learning Workshop, Advances in Neural Information Processing Systems (NIPS Conference), 2014.
[18] A. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio, “Fitnets: Hints for thin deep nets,” in 3rd International Conference on Learning Representations, ICLR, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.
[19] P. Luo, Z. Zhu, Z. Liu, X. Wang, and X. Tang, “Face model compression by distilling knowledge from neurons,” in Thirtieth AAAI Conference on Artificial Intelligence, 2016.
[20] A. Krizhevsky, G. Hinton, et al., “Learning multiple layers of features from tiny images,” Technical Report, 2009.
[21] V. Lebedev and V. Lempitsky, “Fast convnets using group-wise brain damage,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 2554–2564, 2016.
[22] Z. Huang and N. Wang, “Data-driven sparse structure selection for deep neural networks,” in Proceedings of the European Conference on Computer Vision (ECCV), pp. 304–320, 2018.
[23] K. Ullrich, E. Meeds, and M. Welling, “Soft weight-sharing for neural network compression,” in 5th International Conference on Learning Representations, ICLR, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
[24] S. J. Nowlan and G. E. Hinton, “Simplifying neural networks by soft weight-sharing,” Neural Computation, vol. 4, no. 4, pp. 473–493, 1992.
[25] J. Ye, X. Lu, Z. Lin, and J. Z. Wang, “Rethinking the smaller-norm-less-informative assumption in channel pruning of convolution layers,” in 6th International Conference on Learning Representations, ICLR, Vancouver, BC, Canada, April 30 - May 3, 2018, Conference Track Proceedings, 2018.
[26] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for linear inverse problems,” SIAM journal on imaging sciences, vol. 2, no. 1, pp. 183–202, 2009.
[27] A. Gordon, E. Eban, O. Nachum, B. Chen, H. Wu, T.-J. Yang, and E. Choi, “Morphnet: Fast & simple resource-constrained structure learning of deep networks,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 1586–1595, 2018.
[28] A. G. Howard, M. Zhu, B. Chen, D. Kalenichenko, W. Wang, T. Weyand, M. Andreetto, and H. Adam, “Mobilenets: Efficient convolutional neural networks for mobile vision applications,” arXiv preprint arXiv:1704.04861, 2017.
[29] Guo, A. Yao, and Y. Chen, “Dynamic network surgery for efficient dnns,” in Advances in Neural Information Processing Systems 29: Annual Conference on Neural Information Processing Systems, Barcelona, Spain, pp. 1379–1387, 2016.
[30] H. Li, A. Kadav, I. Durdanovic, H. Samet, and H. P. Graf, “Pruning filters for efficient convnets,” in 5th International Conference on Learning Representations, ICLR, Toulon, France, April 24-26, 2017, Conference Track Proceedings, 2017.
[31] J.-H. Luo, J. Wu, and W. Lin, “Thinet: A filter level pruning method for deep neural network compression,” in Proceedings of the IEEE international conference on computer vision, pp. 5058– 5066, 2017.
[32] Y. He, X. Zhang, and J. Sun, “Channel pruning for accelerating very deep neural networks,” in Proceedings of the IEEE International Conference on Computer Vision, pp. 1389–1397, 2017.
[33] R. Yu, A. Li, C.-F. Chen, J.-H. Lai, V. I. Morariu, X. Han, M. Gao, C.-Y. Lin, and L. S. Davis, “Nisp: Pruning networks using neuron importance score propagation,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 9194–9203, 2018.
[34] A. Prakash, J. Storer, D. Florencio, and C. Zhang, “Repr: Improved training of convolutional filters,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 10666– 10675, 2019.
[35] H. Wang, Q. Zhang, Y. Wang, and H. Hu, “Structured probabilistic pruning for convolutional neural network acceleration,” in British Machine Vision Conference 2018, BMVC, Northumbria University, Newcastle, UK, September 3-6, 2018, p. 149, 2018.
[36] Y. He, P. Liu, Z. Wang, Z. Hu, and Y. Yang, “Filter pruning via geometric median for deep convolutional neural networks acceleration,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4340–4349, 2019.
[37] P. T. Fletcher, S. Venkatasubramanian, and S. Joshi, “Robust statistics on riemannian manifolds via the geometric median,” in 2008 IEEE Conference on Computer Vision and Pattern Recognition, pp. 1–8, IEEE, 2008.
[38] X. Dong, S. Chen, and S. Pan, “Learning to prune deep neural networks via layer-wise optimal brain surgeon,” in Advances in Neural Information Processing Systems, pp. 4857–4867, 2017.
[39] J. Kim, Y. Park, G. Kim, and S. J. Hwang, “Splitnet: Learning to semantically split deep networks for parameter reduction and model parallelization,” in Proceedings of the 34th International Conference on Machine Learning-Volume 70, pp. 1866–1874, JMLR. org, 2017.
[40] T.-J. Yang, Y.-H. Chen, and V. Sze, “Designing energy-efficient convolutional neural networks using energy-aware pruning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 5687–5695, 2017.
[41] C. Zhao, B. Ni, J. Zhang, Q. Zhao, W. Zhang, and Q. Tian, “Variational convolutional neural network pruning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 2780–2789, 2019.
[42] C. Bucilua, R. Caruana, and A. Niculescu-Mizil, “Model compression,” in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, pp. 535–541, ACM, 2006.
[43] J. Ba and R. Caruana, “Do deep nets really need to be deep?,” in Advances in neural information processing systems, pp. 2654–2662, 2014.
[44] J. Yim, D. Joo, J. Bae, and J. Kim, “A gift from knowledge distillation: Fast optimization, network minimization and transfer learning,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 4133–4141, 2017.
[45] K. He, X. Zhang, S. Ren, and J. Sun, “Identity mappings in deep residual networks,” in European Conference on Computer Vision (ECCV), pp. 630–645, Springer, 2016.
[46] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” arXiv preprint arXiv:1412.6980, 2014.
[47] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feedforward neural networks,” in Proceedings of the thirteenth international conference on artificial intelligence and statistics, pp. 249–256, 2010.
[48] F. Chollet et al., “Keras.” https://keras.io, 2015.
[49] H. Cai, L. Zhu, and S. Han, “Proxylessnas: Direct neural architecture search on target task and hardware,” in 7th International Conference on Learning Representations, ICLR, New Orleans, LA, USA, May 6-9, 2019, 2019.

Details

Primary Language

English

Subjects

Engineering

Journal Section

Research Article

Authors

Eren Şener
0000-0002-0612-6451
Türkiye

Emre Akbaş ^*
0000-0002-3760-6722
Türkiye

Publication Date

September 30, 2022

Submission Date

July 6, 2022

Acceptance Date

August 31, 2022

Published in Issue

Year 2022 Volume: 10 Number: 3

DOI

https://doi.org/10.29109/gujsc.1141648

IZ

https://izlik.org/JA27JY43GC

Cite

RIS / Bibtex

APA

Şener, E., & Akbaş, E. (2022). Improved Knowledge Distillationwith Dynamic Network Pruning. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım Ve Teknoloji, 10(3), 650-665. https://doi.org/10.29109/gujsc.1141648

AMA

1.Şener E, Akbaş E. Improved Knowledge Distillationwith Dynamic Network Pruning. GUJS Part C. 2022;10(3):650-665. doi:10.29109/gujsc.1141648

Chicago

Şener, Eren, and Emre Akbaş. 2022. “Improved Knowledge Distillationwith Dynamic Network Pruning”. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım Ve Teknoloji 10 (3): 650-65. https://doi.org/10.29109/gujsc.1141648.

EndNote

Şener E, Akbaş E (September 1, 2022) Improved Knowledge Distillationwith Dynamic Network Pruning. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım ve Teknoloji 10 3 650–665.

IEEE

[1]E. Şener and E. Akbaş, “Improved Knowledge Distillationwith Dynamic Network Pruning”, GUJS Part C, vol. 10, no. 3, pp. 650–665, Sept. 2022, doi: 10.29109/gujsc.1141648.

ISNAD

Şener, Eren - Akbaş, Emre. “Improved Knowledge Distillationwith Dynamic Network Pruning”. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım ve Teknoloji 10/3 (September 1, 2022): 650-665. https://doi.org/10.29109/gujsc.1141648.

JAMA

1.Şener E, Akbaş E. Improved Knowledge Distillationwith Dynamic Network Pruning. GUJS Part C. 2022;10:650–665.

MLA

Şener, Eren, and Emre Akbaş. “Improved Knowledge Distillationwith Dynamic Network Pruning”. Gazi Üniversitesi Fen Bilimleri Dergisi Part C: Tasarım Ve Teknoloji, vol. 10, no. 3, Sept. 2022, pp. 650-65, doi:10.29109/gujsc.1141648.

Vancouver

1.Eren Şener, Emre Akbaş. Improved Knowledge Distillationwith Dynamic Network Pruning. GUJS Part C. 2022 Sep. 1;10(3):650-65. doi:10.29109/gujsc.1141648