Improved Knowledge Distillation with Dynamic Network Pruning

Deploying convolutional neural networks to mobile or embedded devices is often prohibited by limited memory and computational resources. This is particularly problematic for the most successful networks, which tend to be very large and require long inference times. Many alternative approaches have been developed for compressing neural networks based on pruning, regularization, quantization or distillation. In this paper, we propose the “Knowledge Distillation with Dynamic Pruning” (KDDP), which trains a dynamically pruned compact student network under the guidance of a large teacher network. In KDDP, we train the student network with supervision from the teacher network, while applying L1 regularization on the neuron activations in a fully-connected layer. Subsequently, we prune inactive neurons. Our method automatically determines the final size of the student model. We evaluate the compression rate and accuracy of the resulting networks on an image classification dataset, and compare them to results obtained by Knowledge Distillation (KD). Compared to KD, our method produces better accuracy and more compact models.


INTRODUCTION
Deep neural networks have enabled many applications in a diverse set of domains including vision, language, medicine and robotics.However, these models require large amounts of processing power and memory, which severely limits their deployability in limited-resource computers.New possibilities would emerge if such models can be deployed in embedded platforms, mobile and edge devices.Therefore, research on "neural network compression", that is reducing the processing and memory requirements of neural networks, is important.
Early work on neural network compression aimed to make a large network smaller by removing redundant structures.These can be weights, neurons, blocks, etc. LeCun et al. proposed one of the pioneering network compression methods, the Optimal Brain Damage method [1], which was followed by many magnitudebased network pruning methods [2,3].These approaches work by removing weights that are close to zero.To prune more structures and make the models smaller, regularization can be used to enforce sparsity.Han et al. proposed one of the first regularization-based model compression methods.Following this work, some other methods [5][6][7] that use regularization on different structures were also proposed.For convolutional networks, researchers have designed novel convolutional filters to save parameters which decrease redundancy [8][9][10].Research on low-rank factorization methods [11][12][13] tries to find informative parameters by using matrix or tensor decomposition.Another major body of work [14][15][16] reduces the number of bits that represent each weight.
A prominent approach to network compression is the "knowledge distillation" (KD) method [17], where a large, cumbersome model called the teacher guides the training of a much smaller model called the student (Figure 1).The student network is trained with two losses: (i) the usual cross-entropy loss coming from the training set, (ii) the "softened" class probabilities output by the teacher, computed via a hyper-parameter called temperature.The aim of "softened" probabilities is to increase the information about the target class by introducing uncertainty into the probability distribution.Softened probabilities contain similarity information on different classes, which is absent in the one-hot labels coming from the training set.[17].In KD, student model is trained with a linear combination of two losses.One loss comes from the one-hot (or hard) labels and the other from the "softened" labels.The architecture of the student model in KD is pre-determined and does not change during or after training.However, in our method, we prune fully-connected layers in the student network based on neuron activations to get a more compact model whose size is determined dynamically and automatically.KD based methods [18,19] yield good performance on computer vision tasks and have had a significant impact on model compression.However, a major disadvantage of KD is that the user has to specify the student newtork architecture and this architecture do not change during or after training.While KD can succesfully distil the knowledge of the teacher into the student, we do not know whether the student model is unnecessarily large or smaller than it should be.In this paper, we address this disadvantage by dynamically pruning the student network based on neuron activations, to obtain a more compact student model.For pruning, we target the largest fully-connected layer of the student model, which typically contain the largest percentage of neurons in the student model.Specifically, during the KD training, we apply L1 regularization on the activations of the neurons in a selected fully-connected layer of the student model to impose sparsity.Then, we calculate the average activation over training examples, of each neuron in this layer.We prune those neurons having an average activation below a certain threshold by directly removing them from the network.To the best of our knowledge, our compression technique is the first method that combines KD and L1 regularization in this way.We name our method as Knowledge Distillation with Dynamic Pruning, or KDDP for short.Since our method can only prune fully-connected layers, its typical targets are Multilayer Perceptrons (MLP) or CNNs with large fully-connected (fc) layers.

Figure 1. Illustration of the standard Knowledge Distillation method
We extensively analyze and compare standard training from scratch, knowledge distillation and our proposed method, KDDP, on the CIFAR10 dataset [20].Experiments show that our method performs better than the baselines (standard training and KD), while also sigfinicantly compressing the student model.Furthermore, we find that setting hyper-parameters is crucial for KD based methods.Temperature, T, distillation weight, α, and L1 regularization penalty should be tuned to find a good balance between the model size and the classification performance.In summary, when the hyper-parameters are chosen carefully, our method works well.
Our contributions in this paper can be summarized as follows.
• We propose a new dynamic compression method based on KD.It dynamically prunes inactive neurons in selected fully-connected layers from the student network.Unlike KD, our method does not require the final size of the compressed model as input; it is determined dynamically.• We experimentally analyze our method and compare against standard training from scractch and KD.We make extensive experiments on our hyper-parameters to find meaningful relations with the accuracy of the compressed model.• We test our method on the CIFAR10 dataset.We get better accuracies than both standard training and KD methods with much fewer parameters.
In the rest of the paper, we first summarize the neural network model compression literature in Section 2. We describe our proposed method and its implementation details in Section 3.Then, we analyze the effectiveness of our approach with experiments performed on the CIFAR10 dataset in Section 4, and finally conclude in Section 5.

BACKGROUND AND RELATED WORK
Here we review the literature on model compression in deep neural networks in two main categories: (i) parameter pruning and sharing, (ii) knowledge distillation.We give more detailed information about the parameter pruning and sharing due to its direct relation to our method.

Parameter Pruning And Sharing
Parameter pruning attracted many researchers since the early development of neural networks due to its effectiveness on reducing model complexity and over-fitting.It is also shown that pruning redundant parameters from the network improves generalization, which is an important side-effect.
Early works to prune parameters are Optimal Brain Damage [1] (OBD) and Optimal Brain Surgeon [2].In their work, the authors remove redundant paramaters after sorting them by their saliencies.Saliency is measured based on the Hessian of the objective function.Recently, Srinivas and Babu also showed how similar neurons, which have similar weight sets, are redundant [3].Since Hessian computation is heavy, they propose a more systematic way than OBD and data-free method to remove them.
Most of the follow-up works use sparsity constraints (L0, L1-norm, etc.) in the optimization problems to obtain redundancy.Researchers use these constraints on different elements (e.g.weights, blocks, etc.).Han et al. are one of the first to propose a regularization-based method on model compression [4].They apply L2 regularization during the training phase in order to have near zero-valued parameters.Then they prune all lowweight connections from the network.The deep compression method [15] uses the same procedure as [4] for removing redundant connections.The authors also add quantization and Huffman coding on top of the pruned network to have a more compact one.
Recently, redundancy in convolutional networks also has been explored.Lebedev and Lempitsky [21] apply the idea of Optimal Brain Damage [1] to convolutional filters.They remove entries of L2,1-norm regularization applied convolution filters, which are below a threshold, in a group-wise fashion.Similarly, work by Zhou et al. [5] enforce low-rank constraints on tensors and L2,1-norm regularization on the objective function during the training stage to achieve compact CNNs with reduced neurons.Another study which uses L2,1-norm is Wen et al.'s work [6].They apply regularization to big baseline models to learn more compact CNNs.With their structured sparsity method, they regularize filters, channels, filter shapes and layer depth of CNNs.Huang and Wang [22] improve the method of Wen et al. [6] and propose a more general end-to-end method for network pruning.Their method contains a factor to scale the output of a specific neurons, groups or blocks.They apply L1-norm sparsity regularization to the scaling factors.
Structures having scaling factors below a threshold are removed from the network while training.Unlike the previous works, Ullrich et al. [23] base their regularization on the soft weight-sharing method [24].They compress weights of the pre-trained model into clusters by fitting mixtures of Gaussian models.After retraining the model with new weights concentrated on the cluster means, they obtain a layer-wise-pruned compact network.
There are also methods which focus on sparsity in batch-normalization (BN) layers.Liu et al. [7] add a scaling factor after BN layers.L1 regularization is applied on these scaling factors during training for the purpose of identifying redundant filters.Then, they prune channels with near-zero scaling factors.Another recent study [25] uses the method proposed by Beck and Teboulle [26] to enforce sparsity on the γparameter in BN operator.During training, this method makes some γ values zero and helps these channels to block sample-wise (for each sample in the training set) information flow.After the training is completed, they remove these constant-valued channels from the original network.The study MorphNet [27] uses a combination of three ideas above: first, an L1-norm-based regularization of the neurons, second, the idea of multipliers of Howard et al. [28] for reducing the floating point operations and model size, and third, the paradigm introduced by Han et al. [4] for retraining of the pruned network.
There has also been some research for measuring the redundancy in the networks.Guo et al. present a feedback mechanism named splicing which re-establishes mistakenly removed parameters after the pruning operation [29].With this work, they show that measuring the redundancy of the parameters is an extremely difficult task.Researchers use different techniques for measuring redundancy.In [30], L1-norm of kernels are calculated.After sorting kernels by their L1norm values, small valued kernels and corresponding feature maps were pruned.ThiNet does filter-level pruning based on filter statistics computed from the following layer, not the current layer [31].In spite of their success, the compression rate of the filters had to be predefined, which is another difficult problem for pruning methods.Moreover, He et al. exploit feature maps for redundancy [32].The authors select the most representative channels of the feature maps and prune the redundant ones.After pruning, in order not to damage accuracy, they reconstruct the outputs with the remaining channels using linear least squares.
Recently, several methods proposed to measure the importance of structures.Yu et al. propose that layerby-layer network pruning leads to significant reconstruction error propagation [33].They introduce a global neuron importance measuring algorithm which uses information at the Final Response Layer (FRL, the second-to-last layer before classification).The algorithm obtains the importance of all neurons in the network with a single backward pass after a feature ranking operation on the FRL.Subsequently, the trimming of the whole network is performed considering the pruning ratio per layer as a pre-defined hyperparameter.Prakash et al. propose a novel inter-filter orthogonality metric for ranking filter importance and a new training strategy [34].Their method consists of temporarily dropping (some) of the least important convolutional filters (ranked by their metric), and reintroducing dropped filters with new weights.They repeat this process cyclically.With this strategy, they improve generalization and reduced overlap of learned features.Unlike the traditional deterministic methods, Wang et al. approach pruning weights of convolutional layers in a probabilistic manner [35].They specify a pruning probability for each weight group.At each iteration, these probabilities are updated with the L1 norm as an importance criterion of each weight group.The pruning is guided by sampling from the pruning probabilities.He et al. use a novel pruning method instead of norm-based pruning approaches [36].They calculate the geometric median Fletcher et al. of the filters within the same layer, and prune the filter(s) near to the geometric median [37].In addition to above works, Dong et al. [38] improve the idea in previous works [1,2].Their pruning method is based on second order derivatives of a layer-wise error function.
There are also some recent and novel compression techniques used for pruning.In SplitNet [39], the goal is to find a tree-structured network that contains a set or a hierarchy of subnetworks, where the leaf-level subnetworks are associated with a specific group of classes.Since each group uses a subset of features that are completely disjoint from the ones used by other groups, the splitting algorithm prunes out inter-group connections while optimizing the cross entropy loss and the group regularization.At the end, the weight matrix can be explicitly split into block diagonal matrices to reduce the number of parameters.Similarly, Yang et al. approach the network compression from energy consumption of the network [40].They sort layers by their energy consumption, and pruned weights, which have small magnitudes, of the layers that consume the most energy first.Similar to the idea of "network slimming" [7], Zhao et al. modify the BN layer and add a new parameter called channel saliency to the BN layer [41].They try to find approximate gamma distributions over these channel saliency parameters.They then remove redundant channels with mean and variance of their gamma distributions less than predefined thresholds.

Knowledge Distillation
Knowledge Distillation is a simple way to have compact deep learning models.In this method, a large (i.e.cumbersome) network or an ensemble model is trained, first.This model is called the "teacher", which typically produces accurate predictions.Then, a smaller network, called the "student", is trained using the guidance coming from the teacher model.This guidance is obtained using "temperature softmax" applied on the logits of the teacher.The goal is to provide a better training for the student model than using only the labels from the dataset.Trained in this way, the final student network was shown to produce comparable results to the teacher's [18].
Similar ideas to knowledge distillation has been explored before.Bucilua et al. approach the idea of knowledge transfer from a different point of view [42].Instead of training a neural network on an original small set, they use an ensemble of base-level classifiers to label a large unlabeled dataset and then train the network on this much larger dataset.Ba and Caruana propose using L2 loss on the logits to mimic the teacher network [43].
FitNets [18] use knowledge distillation to yield deep and thin student networks that perform on par with or better than the teacher.They achieve this by training some student layers using the teacher's supervision for better initialization.Luo et al. show that using L2 loss to match the features of top hidden layers from both teacher and student is effective [19].Yim et al. distill knowledge from the teacher by generating a matrix from feature maps at each layer [44].Then, they transfer the knowledge from teacher to student, which has the same depth as the teacher, by applying L2 loss to these matrices.

Other Approaches
There are also other approaches to neural network compression and pruning that are orthogonal to our method.These include low-rank factorization methods, quantization and binarization methods and methods that aim to obtain compact convolutional filters.

Summary
Given the context of existing work, although L1 regularization to enforce sparsity is commonly used for the purposes of pruning/compression, it has not been applied in the context of KD to obtain a student model whose size is determined dynamically and automatically.

METHOD
Before we present our method in detail, we first describe the knowledge distillation (KD) method [17] for completeness.In KD, there are two models: teacher and student.Given a supervised dataset, the teacher model is trained first.Then, the student model is trained using a linear combination of two losses: (i) the regular crossentropy loss coming from the supervised dataset, and (ii) "softened" cross-entropy loss coming from the teacher's prediction.To better explain these two losses, let us consider an example input image x with its ground-truth label y, which is a one-hot vector.A neural network outputs a raw score, or logit, zi for quantifying the degree that the input x belongs to class i.These logits are normalized using the "softmax" function so that the resulting vector can be considered as a probability distribution: where C is the number of classes and T is called the "temperature" parameter, which by default equals to 1.Then, cross-entropy between q = [q1,q2,...,qC] and y is computed as ( When T > 1, we call the loss as "softened" cross-entropy, as it softens the effect of the exponential function in softmax as T gets larger.
Let q tch,T denote the softmax output of the teacher model with temperature T. Similarly, let q std,T be the student's softmax output with temperature T. First, the teacher model is trained to minimize the regular cross-entropy loss: And, the knowledge-distillation loss, by which the student model is trained, can be written as The first term is the regular cross-entropy loss with one-hot ground-truth labels.The second term is the cross-entropy between temperature-softmax outputs.It is this second term, which brings in new information about class similarities predicted by the teacher.α is an hyperparameter to adjust the contribution of the two terms.Figure 1 illustrates the KD method.
In KD, both the teacher and the student model architectures are determined before training and are fixed during and after training.So, essentially, one hast to decide on the size of the student beforehand and KD attempts to distill the knowledge of the teacher into this student.However, there is no way of knowing the optimal size for the student architecture beforehand.Our method addresses this problem by dynamically pruning (removing) neurons from the student.By doing so, our method both finds an optimal size for the student model and slightly improves the final accuracy of the student model.In the following we describe our method.

Knowledge Distillation With Dynamic Pruning (KDDP)
As done in standard KD, we first train the teacher model, or it is provided as an already trained model.Then, we add L1 regularization to the largest fullyconnected layer of the studentlet us call this layer fc1.The rationale behind this choice is that this layer typically contains a large percentage (up to 83% in our experiments) of all parameters in the model (Table 1).Next, we train the student using the KD loss defined in Equation ( 4).After the student is trained, we run it on the training set to calculate the average activation (i.e.output) of each neuron at fc1.If the activation of a neuron is below 10−6, we prune (i.e.kill or remove) that neuron and delete the corresponding weight set in its next layer.After testing all neurons at fc1, we re-train the pruned student network using Equation (4), this time without any L1 regularization on fc1.

Teacher And Student Models
As the teacher, we use a ResNet model [45].ResNet and its variants proved their success on many computer vision tasks.Specifically, our teacher model is a ResNet-56 which achieves 6.97% error rate on CIFAR10 and has 850K learnable parameters.Details of ResNet-56 can be found in the original ResNet paper [45].We choose our student network to have a very simple architecture in order to efficiently analyze the performance of our method.Our student network architecture starts with an input layer for 32×32×3 sized images.It is followed by a convolutional layer with a kernel size of 7×7 with a stride of 1, with 64 convolution filters.This result in an output of size 16 × 16 × 64.The convolutional layer is followed by a Batch Normalization (BN) layer and a ReLU non-linear activation function.We later use a max-pooling layer which has a window size of 3×3 with stride 2 that produces an output of size 7×7×64.This layer followed by an identity-block of the ResNet architecture [45].ResNet's identity block is composed of 3 convolutional layers, each followed by a BN and a ReLU layer.The first and third convolutional layers have a kernel size of 1×1, and the middle layer has a kernel size of 3 × 3. The stride of all convolutions of the identity blocks is 1 and the number of filters used in each layer is 64.Before the ReLU layer of the last convolutional layer inside the residual block, there is a skip connection that allows the flow of information from the initial layers to the last layers by adding the input of the identity block and the output of the ReLU layer.The identity-block is followed by an average pooling layer which outputs a 3×3×64-dimensional tensor.This layer is followed by a fully-connected layer, fc1, and a ReLU layer.Finally, the ReLU layer is followed by another fully connected layer, fc2, as a bridge to a softmax layer at the end.
In our experiments, we create three different variant of this student model.The only difference between these student models are the neuron counts in the first fully-connected layer, fc1.We use 50, 100 and 500 neurons for this layer to explore the effect of the increasing number of neurons.We set the number of neurons in the second fully-connected layer, fc2, to the number of classes in the classification task at hand.The total number of parameters and percentage of parameters in fc1 for these networks are presented in Table 1. Figure 2 illustrates the architecture of our student network.

Baseline Methods
We compare our method with the following models.

Vanilla SN:
We train the student network from scratch without any teacher guidance or regularization penalty.We use this model to find out the baseline performance of our student networks.

Vanilla-KD SN:
We train the student network with standard Knowledge Distillation [17] at different temperature values (T) but without regularization penalty.

Implementation Details
Teacher Network (TN): We train a ResNet-56 model from scratch.The learning rate is 10 −4 , the minibatch size is 64, and the optimization algorithm is Adam [46].

Student Networks (SN):
We use the same hyper-parameters while training all student models.All models are trained from scratch.Weights and biases are initialized with Xavier's initialization [47].Network architectures are implemented using the Keras framework [48].Adam [46] is used for training.The learning rate is set to 10 −4 , and the mini-batch size is 64.An L1 regularization penalty is applied on fc1 during the training of the KDDP student networks.The training is stopped early if there is no improvement in the accuracy on the validation set for 50 epochs.

EXPERIMENTS
In this section, we describe the experimental evaluation and validation of our method.We evaluate it by comparing against the two baselines and then, provide extensive experiments on hyper-parameters in Section 4.3.

Figure 3.
Example images from the CIFAR10 dataset.

Dataset
We use the CIFAR10 dataset [20] in our experiments.It contains 60000 32x32 color images in 10 classes, with 6000 images per class.Example images can be seen in Figure 3).There are 50000 training images and 10000 test images.We randomly sample (using stratified sampling, i.e., by preserving class frequencies)10000 images from the training set to form a validation set.We report our results after observing no improvements on the validation set for 50 epochs during training.As data augmentation, we only use horizontal flip.We use this setup in all experiments.

Analysis of the Proposed Method
We present our main results in Table 2, where we compare the performances and parameter counts of the teacher model, vanilla SN model, vanilla KD SN model and our KDDP model.We use the same teacher logits in for all experiments (i.e.z in Eq. ( 1)).We train our teacher once.We use the same initial weights for all student network trainings with hyper-parameters: L1 = 1e−4 and α = 0.5.We also train Vanilla SNs and Vanilla-KD SNs for each model to explore the capacity of these networks and compare with our model.
The teacher network has 1.7M parameters and yields an accuracy of 88.08% on the test set.This score is lower than other ResNet results on the same set, e.g.Cai et al. achieve 97.92% [49].This is because we hold out 10K examples from the training set as validation data to have a solid early stopping criterion.Also, we only use horizontal-flip augmentation.
The "Vanilla SN" (student network), which is the SN trained from scratch without any teacher guidance or regularization, has three versions.These versions differ only in the number of neurons in the fc1 fully connected layer.For 50, 100 and 500 neurons, Vanilla SN achieves 80.48%, 80.75%, 81.28% test set accuracy, respectively.Increasing the number of neurons in fc1 has a positive effect on model performance.However, this causes an increase in the number of parameters, as well.When the student model is trained using standard knowledge distillation method, we obtain the "Vanilla KD SN" models.Compared to the Vanilla SNs, they achieve around 1% better accuracy for all models.
We conduct further experiments to compare our method against KD to provide fair comparisons based on the total number of neurons in the network.We record the number of final neurons in KDDP and we train smaller Vanilla KD SNs that have the same neuron counts at fc1 with the final KDDP SNs.We denote these models with "Vanilla KD SNn" where n is equal to 45, 83 or 264.We use the same softmax temperature for both bases.We observe that with the same fc1 size, KDDP outperforms Vanilla KD.
From these results, we conclude that our dynamic pruning method both improves accuracy and reduces computational cost of inference.In the following, we analyze the sensitivity of our method to its hyperparameters, and also conduct statistical significance analysis.

L1 Regularization Penalty Analysis
We use L1 regularization on the activations of fc1 layer neurons to increase sparsity.L1 regularization penalizes the absolute value of the activations of the neurons.We present results for different L1 penalties in Table 3.We set α to 0.5 in these experiments.We observe that larger values of L1 penalty result in fewer active neurons at the fc1 layer and therefore decreases the performance of the models.For example, when L1 is 1e −3 and T = 32 at KDDP SN100 experiment, the model gets stuck at some local minima and cannot even reach the vanilla model's performance.However, when there are fewer parameters it helps the model to get acceptable performances.For example, for hyperparameters L = 1e −3 , T = 2, our KDDP SN50 model achieves better performance than the other SN50 models.We also observe that using smaller values for L1, e.g.L1 = 1e −5 , does not work for our pruning method in all student models.Therefore, L1 penalty should be tuned to strike a good balance between the model size and the classification performance.In our experiments, we set the L1 regularization to 1e −4 .
Table 3.Effect of L1 regularization penalty.Results for models SN50, SN100, SN500.Hyper-parameter α = 0.5.4) sets the contribution of the two objective functions (i.e. the weight of distillation).In other words, using bigger α values means giving more importance to soft targets in the objective function.We present results for different α values in Table 4.We can see that too small and large α values don't lead to good performances.α also should be tuned carefully to have a good balance between the model size and classification performance.In our experiments, we observe that setting α to 0.5 gives the best results.

T Analysis
Setting the vaue of T, which "softens" the softmax output, is not a trivial task.To find its optimal value, we did a grid search over the temperatures values of [2,4,8,12,16,20,32,64,100,200,1000,5000].If we keep increasing T, at some point, logits will be saturated and no information will flow from the teacher to the student network.We present our results in Table 4.We can see that when we train the network with 100 neurons solely with the loss coming from the soft targets (the second term in Eq. ( 4)) with a temperature of 5000, we get an accuracy of 10%, which is equal to the random guess for CIFAR10.For all models, we observe that the accuracy fluctuates depending on T. Therefore, we conclude that the temperature parameter should also be tuned carefully.

Statistica Analysis of the Results
We use Welch's T-test to measure the significance of our method's results.We train Vanilla SN, Vanilla KD and KDDP models with 100 fc1 neurons, starting with different initial weights for 11 times.We set our hyper-parameters as L1 = 1e −4 , α = 0.5.We present these results in Table 5.

KDDP & Vanilla Analysis:
We start with assuming a null hypothesis that the mean of the results of the KDDP is equal to the mean of the results of the Vanilla network.Then, we calculate the T-score of these sets (classification results) using Eq. 5. We get a T-score of 9.91.For two-tailed hypothesis and 10 degrees of freedom, this T-score corresponds to p < .00001., which indicates statistical significant.Therefore, it is safe to reject the null hypothesis that there is no difference between the means of results.
KDDP & Vanilla-KD Analysis: We follow the same computations for comparing our KDDP with the Vanilla-KD.We get a T-score of 2.9730, which corresponds to p = .013974for two-tailed hypothesis with 10 degrees of freedom.Since p <.05, it is again safe to reject the null hypothesis.We conclude that our KDDP model's performance has intrinsic differences from Vanilla SN and Vanilla KD results, and they are strong and are not by chance.

CONCLUSION
In this paper, we propose a new method based on Knowledge Distillation (KD) [17].We use L1 regularization on the activities of the neurons in a fully-connected layer and remove the inactive neurons.There is no need to provide the final size of the student model as input; our method determines it automatically.Our method performs better than the standard KD method with much fewer parameters.
In our extensive experiments, we show that KD based methods including ours are highly hyperparameters dependent.Temperature, T, and distillation weight, α selection determine the performance of the trained model.We observe that the accuracy varies significantly between low and high values for different T values.Moreover, α constrains us to decide to what extent we should rely on the teacher network's logits.However, when the hyper-parameters are chosen carefully, our method works well.It performs better than the baselines.
In conclusion, our method can be used when there is a need for a much smaller network that performs comparably.Moreover, considering the benefits such as comparable accuracy with fewer parameters, one should expect that the hyper-parameter selection is vital for the performance.
Although we did not explore the use of our method for convolutional layers, we expect that similar gains (higher accuracy with fewer parameters) would be obtained.We leave this as future work.

Figure 2 .
Figure 2. Student Network overview.The network takes an image and outputs a class label.It is composed of an input layer followed by a convolutional layer, max pooling, an identity-block of ResNet(He et al., 2016), average pooling, two fully connected layers, and a softmax layer.ResNet's identity block is highlighted with the yellow rectangle.This layer is composed of three convolutional layers and a skip connection which adds the input of the identity block and the output of the last convolutional layer in the identity block.

Table 1 .
Student networks differ only in the number of neurons in the fc1 layer.Percentages in parenthesis indicate the ratio of the parameters in fc1 to the total number of parameters.

Table 2 .
Main results on the CIFAR10 test set.

Table 5 .
Results of 11 different trainings for Vanilla SN, Vanilla KD and KDDP models with 100 neurons in fc1.Hyper-parameters for KDDP are L1 = 1e−4, α = 0.5.Although the difference in mean accuracies are small, they are statistically significant.