Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Question about tips and tricks #1

Open
szm-R opened this issue Aug 24, 2017 · 47 comments
Open

Question about tips and tricks #1

szm-R opened this issue Aug 24, 2017 · 47 comments

Comments

@szm-R
Copy link

szm-R commented Aug 24, 2017

Hi wentianli

I've been testing the knowledge distillation method for a while by playing with Caffe's available layers and I was able to achieve nearly good results with some simple models. It's been a couple of days that I have come across your layer, I examined the source code and it seemed like a good implementation to me. Now I'm trying to use your layer to enhance the accuracy of GoogleNet (using a ResNet model as the teacher). Now I wanted to ask you about any tips you might know about this process, about tuning the hyper parameters like the loss weights, the solver type, learning rate, etc.

I appreciate any help greatly.

@wentianli
Copy link
Owner

wentianli commented Sep 7, 2017

For hyper parameters, I usually set loss weight to 1; temperatures around 2 to 10 often bring similar results, but infinite temperature (i.e. distilling the logits) is quite different. A large number of experiments are needed anyway.

Teachers should be good enough, especially for difficult tasks like ImageNet. However, better performance of the teacher doesn't always lead to better distillation results. You may need to try several teachers.

Remember to freeze all the parameters of the teachers: set lr_mult and decay_mult to 0; set use_global_stats to true for BatchNorm layer; change Dropout layer to Scale layer, etc

@szm-R
Copy link
Author

szm-R commented Sep 10, 2017

Hello again and thanks for your answer.
Could you perhaps tell me what teachers and students have you tried so far? (If they are famous ones like AlexNet, GoogleNet, SqueezeNet and ...)

@wentianli
Copy link
Owner

I haven't tried many models myself. I advise you to take a look at section 4 of this paper. I think various ResNets are useful for validating the training methods.

@adapt-image-models
Copy link

@wentianli Hi, I have some trouble using this layer, can you release one example .prototxt. Thanks a lot.

@Coderx7
Copy link

Coderx7 commented Nov 23, 2017

@wentianli could you please provide an example on how this can be implemented and trained?

@wentianli
Copy link
Owner

For example, the prototxt for CIFAR10 goes like this...

First, there is a Data layer.

layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 32
    mean_value: 125.30691805
    mean_value: 122.95039414
    mean_value: 113.86538318
  }
  data_param {
    source: "/home/cifar10_pad4_train_lmdb"
    batch_size: 128
    backend: LMDB
  }
  image_data_param {
    shuffle: true
  }
}

Then, blob 'data' is fed into the first layer of the student network.

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 16
    pad: 1
    kernel_size: 3
    stride: 1
    weight_filler {
      type: "msra"
    }
    bias_filler {
      type: "constant"
    }
  }
}
...

For classification task, there is usually an InnerProdoct layer which outputs score.

layer {
  name: "score"
  type: "InnerProduct"
  bottom: "pool_global"
  top: "score"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    num_output: 10
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}

In most cases, we use a SoftmaxWithLoss layer to compute the cross entropy loss between score and ground truth label. For knowledge distillation, you can keep it and use a smaller loss_weight.

layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  include {
    phase: TRAIN
  }
  bottom: "score"
  bottom: "label"
  top: "loss"
  loss_weight: 1
}

Similarly, we feed the data into the teacher network. Remember to freeze its weights.

layer {
  name: "conv1_teacher"
  type: "Convolution"
  bottom: "data"
  top: "conv1_teacher"
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }
  convolution_param {
    num_output: 32
    pad: 1
    kernel_size: 3
    stride: 1
  }
}
...

The teacher network also produces a score for classification. Here, we name the blob score_teacher. It corresponds to the term "soft label" or "soft target" in the reference paper.

layer {
  name: "score_teacher"
  type: "InnerProduct"
  bottom: "pool_global_teacher"
  top: "score_teacher"
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }
  inner_product_param {
    num_output: 10
  }
}

Finally, a KnowledgeDistillation layer computes the KL loss between score and score_teacher.

layer {
  name: "KD"
  type: "KnowledgeDistillation"
  bottom: "score"
  bottom: "score_teacher"
  top: "KL_loss"
  include { phase: TRAIN }
  knowledge_distillation_param { temperature: 4}
  loss_weight: 1
}

@wentianli
Copy link
Owner

Here is another way to implement. Since the teacher network is fixed, we can compute and save the aforementioned score_teacher (in hdf5 format) beforehand. When we train the student network, we simply need to load score_teacher with a HDF5Data layer and the teacher network is no more included in the prototxt. This is for better efficiency. However, it is slightly different if data augmentation is used.

@Coderx7
Copy link

Coderx7 commented Nov 23, 2017

Thank you very much, I had the impression that we first train our teacher network on a dataset, then calculate all the logits, then we train our student model on the same dataset, calculate the logits, then for distillation, we would use these two vectors of logits and go on!
looking at your example and explanation, it seems teacher and student network, are trained in parallel and pretraining them is not necessary right?

Edit:
I see your second comment which clears everything now. Thank you very much :)

@zhanglaplace
Copy link

zhanglaplace commented Dec 28, 2017

@wentianli thanks for your train prototxt, i also think the training problem .if the teach network also in training, due to the teacher network is complexity,the batch-size of training have to set small.if we first save the result of the teacher network, training student network could have a large batchsize. but this should change the input layers. thanks for your share, i will have a try.

@wentianli
Copy link
Owner

@zhanglaplace you can use iter_size in solver.prototxt for a large batchsize with limited gpu memory. Besides, training the teacher and student networks simultaneously can be called mutual learning, which is very tricky.

@zhanglaplace
Copy link

@wentianli thanks

@pasxalinamed
Copy link

@wentianli you mention earlier that in the teacher network we should

change Dropout layer to Scale layer, etc

Why this should be changed?

@dawuchen
Copy link

dawuchen commented Jan 5, 2018

@wentianli Hi , I saw you use two loss layers "SoftmaxWithLoss" and "KnowledgeDistillation" in which they both use score as bottom. But in the code of InnerProduct layer of caffe , it only use diff from one top blob, it is different from conv layer which accumulate diffs. So the network may be trained use only one loss . Could you provide the training result using the prototxt above, dose the network work well?

@wentianli
Copy link
Owner

@dawuchen Caffe automatically splits a blob when it is used twice. The diffs are thus accumulated.

@dawuchen
Copy link

dawuchen commented Jan 5, 2018

@wentianli You are right .I made a mistake about the accumulate operation in conv layer, it is for different kernels. Thanks.

@qinxianyuzi
Copy link

My prototxt is like this
input: "data"
layer{
name: "data"
type: "ImageData"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
mean_value: 127.5
mean_value: 127.5
mean_value: 127.5
scale: 0.0078125
}
image_data_param {
source: "cifarlist.txt"
batch_size: 32
new_width: 112
new_height: 112
is_color: true
shuffle: true
}
}
layer {
bottom: "data"
top: "conv1"
name: "conv1"
type: "Convolution"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 32
kernel_size: 3
pad: 1
stride: 1
weight_filler {
type: "msra"
}
bias_filler {
type: "constant"
value: 0
}
}
}
.
.
.
layer {
bottom: "pool_avg"
top: "classifier"
name: "classifier"
type: "InnerProduct"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 10
weight_filler {
type: "msra"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "softmax_loss1"
type: "SoftmaxWithLoss"
bottom: "classifier"
bottom: "label"
top: "softmax_loss1"
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "classifier"
bottom: "label"
top: "accuracy"
include: { phase: TRAIN }
}

layer {
bottom: "data"
top: "conv1s"
name: "conv1s"
type: "Convolution"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 16
kernel_size: 3
pad: 1
stride: 1
weight_filler {
type: "msra"
}
bias_filler {
type: "constant"
value: 0
}
}
}
.
.
.
layer {
bottom: "pool_avgs"
top: "classifiers"
name: "classifiers"
type: "InnerProduct"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 10
weight_filler {
type: "msra"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "softmax_loss2"
type: "SoftmaxWithLoss"
bottom: "classifiers"
bottom: "label"
top: "softmax_loss2"
loss_weight: 0.2
}
layer {
name: "accuracys"
type: "Accuracy"
bottom: "classifiers"
bottom: "label"
top: "accuracys"
include: { phase: TRAIN }
}

layer {
name: "KL_loss"
type: "KnowledgeDistillation"
bottom: "classifiers" #student
bottom: "classifier" #teacher
top: "KL_loss"
include { phase: TRAIN }
knowledge_distillation_param {
temperature: 4
}
loss_weight: 1
}
When I train it, Log warning "KnowledgeDistillation Layer cannot backpropagate to soft label nor label inputs"

@wentianli
Copy link
Owner

@qinxianyuzi the warning occurs because the second bottom doesn't receive any gradients.
you can use propagate_down to stop backprop
layer {
name: "KL_loss"
type: "KnowledgeDistillation"
bottom: "classifiers" #student
bottom: "classifier" #teacher
propagate_down: 1
propagate_down: 0
top: "KL_loss"
include { phase: TRAIN }
knowledge_distillation_param {
temperature: 4
}
loss_weight: 1
}

or freeze the teacher network as said in #2

@qinxianyuzi
Copy link

@wentianli Thanks very much. Does student harder to learn with higher temperature?

@wentianli
Copy link
Owner

@qinxianyuzi the optimal temperature is often between 2 and 10

@qinxianyuzi
Copy link

@wentianli Thank you! Sometimes we should try different temperature arrording to different training task.

@liangzimei
Copy link

hello, @wentianli when i have two teacher models(i.e., ensemble model), how should i arrange the logits of each teacher model? if i choose average strategy to combine the two teachers, can i make a mean operation on the logits of them directly when training a student model?

@wentianli
Copy link
Owner

wentianli commented Jun 7, 2018

@liangzimei Averaging logits is incorrect. The kl loss sums p_i * log(q_i) + constant w.r.t every i, where p_i is the probability that the teacher produces for class i. When there are two teachers, this term becomes 0.5 * p1_i * log(q_i) + 0.5 * p2_i * log(q_i) + constant, which means you need two knowledge_distillation_layers.

@liangzimei
Copy link

liangzimei commented Jun 12, 2018

@wentianli thank you so much, i will have a try. Do you Implement softmaxloss layer with temperature which is employed when training a teacher? Or just use a power layer? thanks in advance.

@wentianli
Copy link
Owner

@liangzimei I didn't use temperature when training a teacher.
A scale layer with fixed weights could solve that.
layer {
name: "XXX"
type: "Scale"
bottom: "XXX"
top: "XXX"
param {
lr_mult: 0
decay_mult: 0
}
scale_param {
filler { value: 0.5 } # here temperature = 2
bias_term: false
}
}

@liangzimei
Copy link

@wentianli ok, you mean when training a teacher, temperature=1 is used in most cases (including hinton's paper)?

@liuqunzhong
Copy link

can be used in a regression model?
for example, face alignment

@wentianli
Copy link
Owner

@liuqunzhong L1 or L2 loss is used for regression. knowledge distillation layer is implemented for classification.

@liangzimei
Copy link

liangzimei commented Jul 10, 2018

hello, @wentianli when i train a student model (i.e., mobilnet-v1 ) taught by an ensemble model (2 models, one of them is mobilenet-v1 ), the student's accuracy is always between two teachers, any suggestions ? thanks in advance...

@wentianli
Copy link
Owner

@liangzimei You mean the student model outperforms its counterpart and underperforms the teacher model? It should be so. To obtain better accuracy, you probably need to replace the mobilenet-v1 in the ensemble with a better one.

@WormCoder
Copy link

thanks for sharing
why kl divergence is adopted for loss instead of cross-entropy?

@wentianli
Copy link
Owner

@WormCoder The only difference between kl divergence and cross entropy is a constant term, which doesn't affect backprop at all. When the student and the teacher have exactly the same outputs (this is our goal for training), kl divergence becomes zero.

@liuqunzhong
Copy link

student is 2424 teacher is 4848 , and two datasets with the same data order. is ok?

@liangzimei
Copy link

@WormCoder yeah, it drops fast in the beginning.

@westnight
Copy link

@liangzimei hello,i want to know when to freeze teacher model, if i prefer to set propagate_down:false, do i need to set weight decay=0 in solver.prototxt? i mean is it enough to freeze a model with the single parameter propagate_down?

@liangzimei
Copy link

liangzimei commented Aug 11, 2018

@westnight when training student model, we should freeze the teacher, we can set all the "lr_mult=0", 'lr_decay=0' in conv layer of the teacher to avoid updating parameters. BN layer may be different, you can refer to the previous replies.

@westnight
Copy link

@liangzimei thank you. To freeze the teacher model, i know setting lr and wd equal zero works.But i wonder if using propagate_down is another way.

@liangzimei
Copy link

@westnight according to my understanding, it is. when i train student model , i use both propagate_down: false and lr_mult: 0 for safe.

@iamweiweishi
Copy link

Thank you for your great work.
To compose the prototxt for training, should I write the teacher prototxt and student prototxt into single file? IF this is right, then how to initialize the 'teacher' with pretrained caffemodel? After training, how to save the partial 'student' model?
Could you pls send me one copy of your prototxt? Thank you so much.

@wentianli
Copy link
Owner

@iamweiweishi Into single file? Yes. Because 'student' is randomly initialized, it is better to name the layers of 'teacher' identical with the pretrained caffemodel, which allows you to directly load the pretrained model. To save partial model, a convenient way is to save the model in HDF5 format, and then you can rename the blobs or delete some blobs. For an example of prototxt, please see the comment above.

@iamweiweishi
Copy link

Thank you. @wentianli It works now.

@hito0512
Copy link

@wentianli I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file, but i don't know how to use teacher.caffemodel, When I started training student model。you say " it is better to name the layers of 'teacher' identical with the pretrained caffemodel, which allows you to directly load the pretrained model. " I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.

@hito0512
Copy link

@iamweiweishi I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file, but i don't know how to use teacher.caffemodel, When I started training student model。I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.

@Wenzhiqiang16
Copy link

@hito0512 Sorry to bother you。Do you solve this problem finally? I have the same trouble.

@wuzuowuyou
Copy link

L1 or L2 loss is used for regression
how about distillation L1 or L2 loss?

@dawuchen
Copy link

dawuchen commented Aug 26, 2022 via email

@Aviator99999
Copy link

@iamweiweishi I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file, but i don't know how to use teacher.caffemodel, When I started training student model。I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.

@hito0512 sorry for disturbing now, I need to work on the above said fashion. Could u please help me with the procedure if found.
Thank You

@dawuchen
Copy link

dawuchen commented Dec 15, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests