# Knowledge distillation

In machine learning, knowledge distillation is the process of transferring knowledge from a large model to a smaller one. While large models (such as very deep neural networks or ensembles of many models) have higher knowledge capacity than small models, this capacity might not be fully utilized. It can be computationally just as expensive to evaluate a model even if it utilizes little of its knowledge capacity. Knowledge distillation transfers knowledge from a large model to a smaller model without loss of validity. As smaller models are less expensive to evaluate, they can be deployed on less powerful hardware (such as a mobile device).[1]

Knowledge distillation has been successfully used in several applications of machine learning such as object detection,[2] acoustic models,[3] and natural language processing.[4] Recently, it has also been introduced to graph neural networks applicable to non-grid data.[5]

## Concept of distillation

Transferring the knowledge from a large to a small model needs to somehow teach to the latter without loss of validity. If both models are trained on the same data, the small model may have insufficient capacity to learn a concise knowledge representation given the same computational resources and same data as the large model. However, some information about a concise knowledge representation is encoded in the pseudolikelihoods assigned to its output: when a model correctly predicts a class, it assigns a large value to the output variable corresponding to such class, and smaller values to the other output variables. The distribution of values among the outputs for a record provides information on how the large model represents knowledge. Therefore, the goal of economical deployment of a valid model can be achieved by training only the large model on the data, exploiting its better ability to learn concise knowledge representations, and then distilling such knowledge into the smaller model, that would not be able to learn it on its own, by training it to learn the soft output of the large model.[1]

Model compression, a methodology to compress the knowledge of multiple models into a single neural network, was introduced in 2006. Compression was achieved by training a smaller model on large amounts of pseudo-data labelled by a higher-performing ensemble, optimising to match the logit of the compressed model to the logit of the ensemble.[6] Knowledge distillation is a generalisation of such approach, introduced by Geoffrey Hinton et al. in 2015,[1] in a preprint that formulated the concept and showed some results achieved in the task of image classification.

## Formulation

Given a large model as a function of the vector variable ${\displaystyle \mathbf {x} }$, trained for a specific classification task, typically the final layer of the network is a softmax in the form

${\displaystyle y_{i}(\mathbf {x} |t)={\frac {e^{\frac {z_{i}(\mathbf {x} )}{t}}}{\sum _{j}e^{\frac {z_{j}(\mathbf {x} )}{t}}}}}$

where ${\displaystyle t}$ is a parameter called temperature, that for a standard softmax is normally set to 1. The softmax operator converts the logit values ${\displaystyle z_{i}(\mathbf {x} )}$ to pseudo-probabilities, and higher values of temperature have the effect of generating a softer distribution of pseudo-probabilities among the output classes. Knowledge distillation consists of training a smaller network, called the distilled model, on a dataset called transfer set (different than the dataset used to train the large model) using the cross entropy as loss function between the output of the distilled model ${\displaystyle \mathbf {y} (\mathbf {x} |t)}$ and the output ${\displaystyle {\hat {\mathbf {y} }}(\mathbf {x} |t)}$ produced by the large model on the same record (or the average of the individual outputs, if the large model is an ensemble), using a high value of softmax temperature ${\displaystyle t}$ for both models[1]

${\displaystyle E(\mathbf {x} |t)=-\sum _{i}{\hat {y}}_{i}(\mathbf {x} |t)\log y_{i}(\mathbf {x} |t).}$

In this context, a high temperature increases the entropy of the output, and therefore provides more information to learn for the distilled model compared to hard targets, at the same time reducing the variance of the gradient between different records and therefore allowing higher learning rates.[1]

If ground truth is available for the transfer set, the process can be strengthened by adding to the loss the cross-entropy between the output of the distilled model (computed with ${\displaystyle t=1}$) and the known label ${\displaystyle {\bar {y}}}$

${\displaystyle E(\mathbf {x} |t)=-t^{2}\sum _{i}{\hat {y}}_{i}(\mathbf {x} |t)\log y_{i}(\mathbf {x} |t)-\sum _{i}{\bar {y}}_{i}\log y_{i}(\mathbf {x} |1)}$

where the component of the loss with respect to the large model is weighted by a factor of ${\displaystyle t^{2}}$ since, as the temperature increases, the gradient of the loss with respect to the model weights scales by a factor of ${\displaystyle {\frac {1}{t^{2}}}}$.[1]

## Relationship with model compression

Under the assumption that the logits have zero mean, it is possible to show that model compression is a special case of knowledge distillation. The gradient of the knowledge distillation loss ${\displaystyle E}$ with respect to the logit of the distilled model ${\displaystyle z_{i}}$ is given by

{\displaystyle {\begin{aligned}{\frac {\partial }{\partial z_{i}}}E&=-{\frac {\partial }{\partial z_{i}}}\sum _{j}{\hat {y}}_{j}\log y_{j}\\&=-{\hat {y}}_{i}{\frac {1}{y_{i}}}{\frac {\partial }{\partial z_{i}}}y_{i}\\&=-{\hat {y}}_{i}{\frac {1}{y_{i}}}{\frac {\partial }{\partial z_{i}}}{\frac {e^{\frac {z_{i}}{t}}}{\sum _{j}e^{\frac {z_{j}}{t}}}}\\&=-{\hat {y}}_{i}{\frac {1}{y_{i}}}\left({\frac {{\frac {1}{t}}e^{\frac {z_{i}}{t}}\sum _{j}e^{\frac {z_{j}}{t}}-{\frac {1}{t}}\left(e^{\frac {z_{i}}{t}}\right)^{2}}{\left(\sum _{j}e^{\frac {z_{j}}{t}}\right)^{2}}}\right)\\&=-{\hat {y}}_{i}{\frac {1}{y_{i}}}\left({\frac {y_{i}}{t}}-{\frac {y_{i}^{2}}{t}}\right)\\&={\frac {1}{t}}\left(y_{i}-{\hat {y}}_{i}\right)\\&={\frac {1}{t}}\left({\frac {e^{\frac {z_{i}}{t}}}{\sum _{j}e^{\frac {z_{j}}{t}}}}-{\frac {e^{\frac {{\hat {z}}_{i}}{t}}}{\sum _{j}e^{\frac {{\hat {z}}_{j}}{t}}}}\right)\\\end{aligned}}}

where ${\displaystyle {\hat {z}}_{i}}$ are the logits of the large model. For large values of ${\displaystyle t}$ this can be approximated as

${\displaystyle {\frac {1}{t}}\left({\frac {1+{\frac {z_{i}}{t}}}{N+\sum _{j}{\frac {z_{j}}{t}}}}-{\frac {1+{\frac {{\hat {z}}_{i}}{t}}}{N+\sum _{j}{\frac {{\hat {z}}_{j}}{t}}}}\right)}$

and under the zero-mean hypothesis ${\displaystyle \sum _{j}z_{j}=\sum _{j}{\hat {z}}_{j}=0}$ it becomes ${\displaystyle {\frac {z_{i}-{\hat {z}}_{i}}{NT^{2}}}}$, which is the derivative of ${\displaystyle {\frac {1}{2}}\left(z_{i}-{\hat {z}}_{i}\right)^{2}}$, i.e. the loss is equivalent to matching the logits of the two models, as done in model compression.[1]

## References

1. Hinton, Geoffrey; Vinyals, Oriol; Dean, Jeff (2015). "Distilling the knowledge in a neural network". arXiv:1503.02531 [stat.ML].
2. ^ Chen, Guobin; Choi, Wongun; Yu, Xiang; Han, Tony; Chandraker, Manmohan (2017). "Learning efficient object detection models with knowledge distillation". Advances in Neural Information Processing Systems: 742–751.
3. ^ Asami, Taichi; Masumura, Ryo; Yamaguchi, Yoshikazu; Masataki, Hirokazu; Aono, Yushi (2017). Domain adaptation of DNN acoustic models using knowledge distillation. IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 5185–5189.
4. ^ Cui, Jia; Kingsbury, Brian; Ramabhadran, Bhuvana; Saon, George; Sercu, Tom; Audhkhasi, Kartik; Sethy, Abhinav; Nussbaum-Thom, Markus; Rosenberg, Andrew (2017). Knowledge distillation across ensembles of multilingual models for low-resource languages. IEEE International Conference on Acoustics, Speech and Signal Processing. pp. 4825–4829.}
5. ^ Yang, Yiding; Jiayan, Qiu; Mingli, Song; Dacheng, Tao; Xinchao, Wang (2020). "Distilling Knowledge from Graph Convolutional Networks" (PDF). Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition: 7072–7081. arXiv:2003.10477. Bibcode:2020arXiv200310477Y.
6. ^ Buciluǎ, Cristian; Caruana, Rich; Niculescu-Mizil, Alexandru (2006). "Model compression". Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining.