Question about tips and tricks #1

szm-R · 2017-08-24T18:08:45Z

Hi wentianli

I've been testing the knowledge distillation method for a while by playing with Caffe's available layers and I was able to achieve nearly good results with some simple models. It's been a couple of days that I have come across your layer, I examined the source code and it seemed like a good implementation to me. Now I'm trying to use your layer to enhance the accuracy of GoogleNet (using a ResNet model as the teacher). Now I wanted to ask you about any tips you might know about this process, about tuning the hyper parameters like the loss weights, the solver type, learning rate, etc.

I appreciate any help greatly.

wentianli · 2017-09-07T11:31:05Z

For hyper parameters, I usually set loss weight to 1; temperatures around 2 to 10 often bring similar results, but infinite temperature (i.e. distilling the logits) is quite different. A large number of experiments are needed anyway.

Teachers should be good enough, especially for difficult tasks like ImageNet. However, better performance of the teacher doesn't always lead to better distillation results. You may need to try several teachers.

Remember to freeze all the parameters of the teachers: set lr_mult and decay_mult to 0; set use_global_stats to true for BatchNorm layer; change Dropout layer to Scale layer, etc

szm-R · 2017-09-10T17:17:48Z

Hello again and thanks for your answer.
Could you perhaps tell me what teachers and students have you tried so far? (If they are famous ones like AlexNet, GoogleNet, SqueezeNet and ...)

wentianli · 2017-09-11T08:31:02Z

I haven't tried many models myself. I advise you to take a look at section 4 of this paper. I think various ResNets are useful for validating the training methods.

adapt-image-models · 2017-10-13T13:26:00Z

@wentianli Hi, I have some trouble using this layer, can you release one example .prototxt. Thanks a lot.

Coderx7 · 2017-11-23T07:20:46Z

@wentianli could you please provide an example on how this can be implemented and trained?

wentianli · 2017-11-23T12:30:23Z

For example, the prototxt for CIFAR10 goes like this...

First, there is a Data layer.

layer {
  name: "data"
  type: "Data"
  top: "data"
  top: "label"
  include {
    phase: TRAIN
  }
  transform_param {
    mirror: true
    crop_size: 32
    mean_value: 125.30691805
    mean_value: 122.95039414
    mean_value: 113.86538318
  }
  data_param {
    source: "/home/cifar10_pad4_train_lmdb"
    batch_size: 128
    backend: LMDB
  }
  image_data_param {
    shuffle: true
  }
}

Then, blob 'data' is fed into the first layer of the student network.

layer {
  name: "conv1"
  type: "Convolution"
  bottom: "data"
  top: "conv1"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  convolution_param {
    num_output: 16
    pad: 1
    kernel_size: 3
    stride: 1
    weight_filler {
      type: "msra"
    }
    bias_filler {
      type: "constant"
    }
  }
}
...

For classification task, there is usually an InnerProdoct layer which outputs score.

layer {
  name: "score"
  type: "InnerProduct"
  bottom: "pool_global"
  top: "score"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  param {
    lr_mult: 2
    decay_mult: 0
  }
  inner_product_param {
    num_output: 10
    weight_filler {
      type: "gaussian"
      std: 0.01
    }
    bias_filler {
      type: "constant"
      value: 0
    }
  }
}

In most cases, we use a SoftmaxWithLoss layer to compute the cross entropy loss between score and ground truth label. For knowledge distillation, you can keep it and use a smaller loss_weight.

layer {
  name: "loss"
  type: "SoftmaxWithLoss"
  include {
    phase: TRAIN
  }
  bottom: "score"
  bottom: "label"
  top: "loss"
  loss_weight: 1
}

Similarly, we feed the data into the teacher network. Remember to freeze its weights.

layer {
  name: "conv1_teacher"
  type: "Convolution"
  bottom: "data"
  top: "conv1_teacher"
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }
  convolution_param {
    num_output: 32
    pad: 1
    kernel_size: 3
    stride: 1
  }
}
...

The teacher network also produces a score for classification. Here, we name the blob score_teacher. It corresponds to the term "soft label" or "soft target" in the reference paper.

layer {
  name: "score_teacher"
  type: "InnerProduct"
  bottom: "pool_global_teacher"
  top: "score_teacher"
  param {
    lr_mult: 0
    decay_mult: 0
  }
  param {
    lr_mult: 0
    decay_mult: 0
  }
  inner_product_param {
    num_output: 10
  }
}

Finally, a KnowledgeDistillation layer computes the KL loss between score and score_teacher.

layer {
  name: "KD"
  type: "KnowledgeDistillation"
  bottom: "score"
  bottom: "score_teacher"
  top: "KL_loss"
  include { phase: TRAIN }
  knowledge_distillation_param { temperature: 4}
  loss_weight: 1
}

wentianli · 2017-11-23T12:43:17Z

Here is another way to implement. Since the teacher network is fixed, we can compute and save the aforementioned score_teacher (in hdf5 format) beforehand. When we train the student network, we simply need to load score_teacher with a HDF5Data layer and the teacher network is no more included in the prototxt. This is for better efficiency. However, it is slightly different if data augmentation is used.

Coderx7 · 2017-11-23T12:45:57Z

Thank you very much, I had the impression that we first train our teacher network on a dataset, then calculate all the logits, then we train our student model on the same dataset, calculate the logits, then for distillation, we would use these two vectors of logits and go on!
looking at your example and explanation, it seems teacher and student network, are trained in parallel and pretraining them is not necessary right?

Edit:
I see your second comment which clears everything now. Thank you very much :)

zhanglaplace · 2017-12-28T08:31:04Z

@wentianli thanks for your train prototxt, i also think the training problem .if the teach network also in training, due to the teacher network is complexity，the batch-size of training have to set small.if we first save the result of the teacher network, training student network could have a large batchsize. but this should change the input layers. thanks for your share, i will have a try.

wentianli · 2017-12-28T10:36:11Z

@zhanglaplace you can use iter_size in solver.prototxt for a large batchsize with limited gpu memory. Besides, training the teacher and student networks simultaneously can be called mutual learning, which is very tricky.

zhanglaplace · 2017-12-28T10:57:10Z

@wentianli thanks

pasxalinamed · 2018-01-04T11:12:14Z

@wentianli you mention earlier that in the teacher network we should

change Dropout layer to Scale layer, etc

Why this should be changed?

dawuchen · 2018-01-05T02:13:48Z

@wentianli Hi , I saw you use two loss layers "SoftmaxWithLoss" and "KnowledgeDistillation" in which they both use score as bottom. But in the code of InnerProduct layer of caffe , it only use diff from one top blob, it is different from conv layer which accumulate diffs. So the network may be trained use only one loss . Could you provide the training result using the prototxt above, dose the network work well?

wentianli · 2018-01-05T02:21:32Z

@dawuchen Caffe automatically splits a blob when it is used twice. The diffs are thus accumulated.

dawuchen · 2018-01-05T05:56:25Z

@wentianli You are right .I made a mistake about the accumulate operation in conv layer, it is for different kernels. Thanks.

qinxianyuzi · 2018-06-04T07:32:34Z

My prototxt is like this
input: "data"
layer{
name: "data"
type: "ImageData"
top: "data"
top: "label"
include {
phase: TRAIN
}
transform_param {
mirror: true
mean_value: 127.5
mean_value: 127.5
mean_value: 127.5
scale: 0.0078125
}
image_data_param {
source: "cifarlist.txt"
batch_size: 32
new_width: 112
new_height: 112
is_color: true
shuffle: true
}
}
layer {
bottom: "data"
top: "conv1"
name: "conv1"
type: "Convolution"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 32
kernel_size: 3
pad: 1
stride: 1
weight_filler {
type: "msra"
}
bias_filler {
type: "constant"
value: 0
}
}
}
.
.
.
layer {
bottom: "pool_avg"
top: "classifier"
name: "classifier"
type: "InnerProduct"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 10
weight_filler {
type: "msra"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "softmax_loss1"
type: "SoftmaxWithLoss"
bottom: "classifier"
bottom: "label"
top: "softmax_loss1"
}
layer {
name: "accuracy"
type: "Accuracy"
bottom: "classifier"
bottom: "label"
top: "accuracy"
include: { phase: TRAIN }
}

layer {
bottom: "data"
top: "conv1s"
name: "conv1s"
type: "Convolution"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
convolution_param {
num_output: 16
kernel_size: 3
pad: 1
stride: 1
weight_filler {
type: "msra"
}
bias_filler {
type: "constant"
value: 0
}
}
}
.
.
.
layer {
bottom: "pool_avgs"
top: "classifiers"
name: "classifiers"
type: "InnerProduct"
param {
lr_mult: 1
decay_mult: 1
}
param {
lr_mult: 2
decay_mult: 0
}
inner_product_param {
num_output: 10
weight_filler {
type: "msra"
}
bias_filler {
type: "constant"
value: 0
}
}
}
layer {
name: "softmax_loss2"
type: "SoftmaxWithLoss"
bottom: "classifiers"
bottom: "label"
top: "softmax_loss2"
loss_weight: 0.2
}
layer {
name: "accuracys"
type: "Accuracy"
bottom: "classifiers"
bottom: "label"
top: "accuracys"
include: { phase: TRAIN }
}

layer {
name: "KL_loss"
type: "KnowledgeDistillation"
bottom: "classifiers" #student
bottom: "classifier" #teacher
top: "KL_loss"
include { phase: TRAIN }
knowledge_distillation_param {
temperature: 4
}
loss_weight: 1
}
When I train it, Log warning "KnowledgeDistillation Layer cannot backpropagate to soft label nor label inputs"

wentianli · 2018-06-04T13:29:39Z

@qinxianyuzi the warning occurs because the second bottom doesn't receive any gradients.
you can use propagate_down to stop backprop
layer {
name: "KL_loss"
type: "KnowledgeDistillation"
bottom: "classifiers" #student
bottom: "classifier" #teacher
propagate_down: 1
propagate_down: 0
top: "KL_loss"
include { phase: TRAIN }
knowledge_distillation_param {
temperature: 4
}
loss_weight: 1
}

or freeze the teacher network as said in #2

qinxianyuzi · 2018-06-05T00:57:07Z

@wentianli Thanks very much. Does student harder to learn with higher temperature?

wentianli · 2018-06-05T09:57:32Z

@qinxianyuzi the optimal temperature is often between 2 and 10

qinxianyuzi · 2018-06-06T01:12:19Z

@wentianli Thank you! Sometimes we should try different temperature arrording to different training task.

liangzimei · 2018-06-07T13:00:40Z

hello, @wentianli when i have two teacher models(i.e., ensemble model), how should i arrange the logits of each teacher model? if i choose average strategy to combine the two teachers, can i make a mean operation on the logits of them directly when training a student model?

wentianli · 2018-06-07T13:38:09Z

@liangzimei Averaging logits is incorrect. The kl loss sums p_i * log(q_i) + constant w.r.t every i, where p_i is the probability that the teacher produces for class i. When there are two teachers, this term becomes 0.5 * p1_i * log(q_i) + 0.5 * p2_i * log(q_i) + constant, which means you need two knowledge_distillation_layers.

liangzimei · 2018-06-12T09:30:16Z

@wentianli thank you so much, i will have a try. Do you Implement softmaxloss layer with temperature which is employed when training a teacher? Or just use a power layer? thanks in advance.

wentianli · 2018-06-12T12:13:39Z

@liangzimei I didn't use temperature when training a teacher.
A scale layer with fixed weights could solve that.
layer {
name: "XXX"
type: "Scale"
bottom: "XXX"
top: "XXX"
param {
lr_mult: 0
decay_mult: 0
}
scale_param {
filler { value: 0.5 } # here temperature = 2
bias_term: false
}
}

liangzimei · 2018-06-12T12:45:07Z

@wentianli ok, you mean when training a teacher, temperature=1 is used in most cases (including hinton's paper)?

liuqunzhong · 2018-06-21T11:13:37Z

can be used in a regression model？
for example， face alignment

wentianli · 2018-06-22T08:52:34Z

@liuqunzhong L1 or L2 loss is used for regression. knowledge distillation layer is implemented for classification.

liangzimei · 2018-07-10T02:40:40Z

hello, @wentianli when i train a student model (i.e., mobilnet-v1 ) taught by an ensemble model (2 models, one of them is mobilenet-v1 ), the student's accuracy is always between two teachers, any suggestions ? thanks in advance...

wentianli · 2018-07-10T07:43:05Z

@liangzimei You mean the student model outperforms its counterpart and underperforms the teacher model? It should be so. To obtain better accuracy, you probably need to replace the mobilenet-v1 in the ensemble with a better one.

WormCoder · 2018-07-17T09:15:04Z

thanks for sharing
why kl divergence is adopted for loss instead of cross-entropy?

wentianli · 2018-07-18T03:46:23Z

@WormCoder The only difference between kl divergence and cross entropy is a constant term, which doesn't affect backprop at all. When the student and the teacher have exactly the same outputs (this is our goal for training), kl divergence becomes zero.

liuqunzhong · 2018-07-18T14:52:57Z

student is 2424 teacher is 4848 , and two datasets with the same data order. is ok？

liangzimei · 2018-07-19T08:33:17Z

@WormCoder yeah, it drops fast in the beginning.

westnight · 2018-08-11T06:37:57Z

@liangzimei hello,i want to know when to freeze teacher model, if i prefer to set propagate_down:false, do i need to set weight decay=0 in solver.prototxt? i mean is it enough to freeze a model with the single parameter propagate_down?

liangzimei · 2018-08-11T15:20:09Z

@westnight when training student model, we should freeze the teacher, we can set all the "lr_mult=0", 'lr_decay=0' in conv layer of the teacher to avoid updating parameters. BN layer may be different, you can refer to the previous replies.

westnight · 2018-08-13T07:30:34Z

@liangzimei thank you. To freeze the teacher model, i know setting lr and wd equal zero works.But i wonder if using propagate_down is another way.

liangzimei · 2018-08-14T13:59:00Z

@westnight according to my understanding, it is. when i train student model , i use both propagate_down: false and lr_mult: 0 for safe.

iamweiweishi · 2018-11-15T01:40:25Z

Thank you for your great work.
To compose the prototxt for training, should I write the teacher prototxt and student prototxt into single file? IF this is right, then how to initialize the 'teacher' with pretrained caffemodel? After training, how to save the partial 'student' model?
Could you pls send me one copy of your prototxt? Thank you so much.

wentianli · 2018-11-15T03:01:01Z

@iamweiweishi Into single file? Yes. Because 'student' is randomly initialized, it is better to name the layers of 'teacher' identical with the pretrained caffemodel, which allows you to directly load the pretrained model. To save partial model, a convenient way is to save the model in HDF5 format, and then you can rename the blobs or delete some blobs. For an example of prototxt, please see the comment above.

iamweiweishi · 2018-11-16T09:30:14Z

Thank you. @wentianli It works now.

hito0512 · 2019-11-23T02:15:32Z

@wentianli I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file， but i don't know how to use teacher.caffemodel, When I started training student model。you say " it is better to name the layers of 'teacher' identical with the pretrained caffemodel, which allows you to directly load the pretrained model. " I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.

hito0512 · 2019-11-23T03:07:05Z

@iamweiweishi I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file， but i don't know how to use teacher.caffemodel, When I started training student model。I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.

Wenzhiqiang16 · 2020-09-16T09:13:59Z

@hito0512 Sorry to bother you。Do you solve this problem finally？ I have the same trouble.

wuzuowuyou · 2022-08-26T07:51:26Z

L1 or L2 loss is used for regression
how about distillation L1 or L2 loss？

dawuchen · 2022-08-26T07:51:45Z

您的来件已收到。谢谢您的来信！

Aviator99999 · 2022-12-15T07:06:06Z

@iamweiweishi I'm sorry to bother you。I still don't know how to initialize the 'teacher' with pretrained caffemodel? I have written the teacher prototxt and student prototxt into single file， but i don't know how to use teacher.caffemodel, When I started training student model。I have already name the layers of 'teacher' identical with the pretrained caffemodel(teacher.caffemodel ), but it doesn't load the pretrained model, should i change the teacher.caffemodel'name ? and where should i put the teacher.caffemodel ? Is that put the teacher.caffemodel and final studen.caffemodel into one place? Thank you so much.

@hito0512 sorry for disturbing now, I need to work on the above said fashion. Could u please help me with the procedure if found.
Thank You

dawuchen · 2022-12-15T07:06:33Z

您的来件已收到。谢谢您的来信！

Question about tips and tricks #1

Question about tips and tricks #1

Comments

szm-R commented Aug 24, 2017

wentianli commented Sep 7, 2017 • edited Loading

szm-R commented Sep 10, 2017

wentianli commented Sep 11, 2017

adapt-image-models commented Oct 13, 2017

Coderx7 commented Nov 23, 2017

wentianli commented Nov 23, 2017

wentianli commented Nov 23, 2017

Coderx7 commented Nov 23, 2017 • edited Loading

zhanglaplace commented Dec 28, 2017 • edited Loading

wentianli commented Dec 28, 2017

zhanglaplace commented Dec 28, 2017

pasxalinamed commented Jan 4, 2018

dawuchen commented Jan 5, 2018

wentianli commented Jan 5, 2018

dawuchen commented Jan 5, 2018

qinxianyuzi commented Jun 4, 2018

wentianli commented Jun 4, 2018

qinxianyuzi commented Jun 5, 2018

wentianli commented Jun 5, 2018

qinxianyuzi commented Jun 6, 2018

liangzimei commented Jun 7, 2018

wentianli commented Jun 7, 2018 • edited Loading

liangzimei commented Jun 12, 2018 • edited Loading

wentianli commented Jun 12, 2018

liangzimei commented Jun 12, 2018

liuqunzhong commented Jun 21, 2018

wentianli commented Jun 22, 2018

liangzimei commented Jul 10, 2018 • edited Loading

wentianli commented Jul 10, 2018

WormCoder commented Jul 17, 2018

wentianli commented Jul 18, 2018

liuqunzhong commented Jul 18, 2018

liangzimei commented Jul 19, 2018

westnight commented Aug 11, 2018

liangzimei commented Aug 11, 2018 • edited Loading

westnight commented Aug 13, 2018

liangzimei commented Aug 14, 2018

iamweiweishi commented Nov 15, 2018

wentianli commented Nov 15, 2018

iamweiweishi commented Nov 16, 2018

hito0512 commented Nov 23, 2019

hito0512 commented Nov 23, 2019

Wenzhiqiang16 commented Sep 16, 2020

wuzuowuyou commented Aug 26, 2022

dawuchen commented Aug 26, 2022 via email

Aviator99999 commented Dec 15, 2022

dawuchen commented Dec 15, 2022 via email

wentianli commented Sep 7, 2017 •

edited

Loading

Coderx7 commented Nov 23, 2017 •

edited

Loading

zhanglaplace commented Dec 28, 2017 •

edited

Loading

wentianli commented Jun 7, 2018 •

edited

Loading

liangzimei commented Jun 12, 2018 •

edited

Loading

liangzimei commented Jul 10, 2018 •

edited

Loading

liangzimei commented Aug 11, 2018 •

edited

Loading