How knowledge distillation compresses neural networks
If you’ve ever used a neural network to solve a complex problem, you know they can be enormous in size, containing millions of parameters. For instance, the famous BERT model has about ~110 million. To illustrate the point, this is the number of parameters for the most common architectures in (natural language processing) NLP, as summarized…