Knowledge Distillation - Keras Code Examples

Рет қаралды 7,896

This Keras Code Examples show you how to implement Knowledge Distillation! Knowledge Distillation has lead to new advances in compression, training state of the art models, and stabilizing Transformers for Computer Vision. All you need to do to build on this is swap out the Teacher and Student architectures. I think the example of how to overwrite keras.Model and integrate two loss functions controlled with an alpha hyperparameter weighting is very useful as well.
Content Links
Knowledge Distillation (Keras Code Examples): keras.io/examples/vision/know...
DistilBERT: arxiv.org/pdf/1910.01108.pdf
Self-Training with Noisy Student: arxiv.org/pdf/1911.04252.pdf
Data-efficient Image Transformers: / data-efficient-image-t...
KL Divergence: en.wikipedia.org/wiki/Kullbac...
0:00 Beginning
0:44 Motivation, Success Stories
2:47 Custom keras.Model
11:18 Teacher and Student models
12:17 Data Loading, Train the Teacher
14:05 Distill Teacher to Student

Жүктеу

Пікірлер: 21

@erikacardenas2964
3 жыл бұрын
These videos have helped me so much!
@arnaudh2082
3 жыл бұрын
Once i have my "distiller" model i can't call a ".predict" method on it so i can't use it in production. Have your a idea to solve this problem ? Ty fo our video :) ps : whereas i can call a ".evaluate"
@gundavenkatashanmukhasaina8474
3 жыл бұрын
Thank you for explanation. Now this can help me to work on Co-Distillation technique.
@fantasma2901
Жыл бұрын
Were you able to make a Co-Distillation model work? Can you show me your code?
@LEI8037
2 жыл бұрын
Very well explained. Helpful. Thank You
@Speedarion
2 жыл бұрын
Can you make a guide on how to do knowledge distillation using Pytorch ?
@godsondeep241
2 жыл бұрын
Thanks for the wonderful video .. I have one question ..can we use a teacher model as resnet 50 and student model as mobile net v2 .. Will it be able to distill the knowledge
@RZRRR1337
3 жыл бұрын
Really nice walkthrough, thank you! I wanted to ask you. What's the difference between this code and DistillBERT?
@connorshorten6311
3 жыл бұрын
DistilBERT has a couple layers of distillation. Not only the output but also the intermediate features. Also I’m not exactly sure about this but I think distilbert is distilling the language model predictions. This entails a high cardinality label set, eg densities on 30,000 tokens from the vocabulary, definitely makes the problem harder. I personally recommend checking out “Well Read Students Learn Better” great paper on distillation in NLP. I have a video on this channel explaining it if interested.
@arnaudh2082
3 жыл бұрын
Once i have my "distiller" model i can't call a ".predict" method on it so i can't use it in production. Have your a idea to solve this problem ? ps : whereas i can call a ".evaluate"
@LouisChiaki
3 жыл бұрын
Is the choosing the KL divergence over the cross entropy really different in Keras? I think keras cross entropy also accept the true label being distribution rather than one-hot.
@connorshorten6311
3 жыл бұрын
Good question, honestly not completely sure but I think cross entropy wants a sparse label distribution as the input. I was just looking through the tf.keras.losses.KLDivergence to try to answer your question but didn't really find anything useful. I think you are right that you can use either KL Divergence or Cross Entropy, but might be safer to just use KL Divergence
@hiteshbalapanuru7348
2 жыл бұрын
from Keras page of above example "If the teacher is trained for 5 full epochs and the student is distilled on this teacher for 3 full epochs, you should in this example experience a performance boost compared to training the same student model from scratch, and even compared to the teacher itself. You should expect the teacher to have accuracy around 97.6%, the student trained from scratch should be around 97.6%, and the distilled student should be around 98.1%. Remove or try out different seeds to use different weight initializations." if training time is not a constraint, then the student can achieve the same accuracy if trained standalone[for more number of epochs relatively]?
@arnaudh2082
3 жыл бұрын
Once i have my "distiller" model i can't call a ".predict" method on it so i can't use it in production. Have your a idea to solve this problem ? Ty fo our video :) ps : whereas i can call a ".evaluate"
@matuscavojsky3082
2 жыл бұрын
I have the same issue, if you look at errors you need to implement a `call()` method
@Alwaysmissthebestpart
3 жыл бұрын
Can anyone tell me how to save distilled model's weight for later use?
@yoonsikp
3 жыл бұрын
Unfortunately from my experience, distilbert had much lower performance than bert or roberta when doing fine-tuning tasks, so please consider this when doing your ML research.
@connorshorten6311
3 жыл бұрын
Is that experience from the HuggingFace pretrained models or your own custom models? Just asking out of my own curiosity haha
@yoonsikp
3 жыл бұрын
@@connorshorten6311 Yep, used the HuggingFace pretrained models!
@melihaslan9509
3 жыл бұрын
You emphasized coding. I did not understand what is knowledge distillation from this video
@rakandhiyaaa92
Жыл бұрын
How do I save the distilled model? Edit: never mind. I can just call distiller.student.save()