"Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint"  error multi clone quantization aware training

------------------------

### System information
- **What is the top-level directory of the model you are using**: models/research/object_detection
- **Have I written custom code (as opposed to using a stock example script provided in TensorFlow)**: No
- **OS Platform and Distribution (e.g., Linux Ubuntu 16.04)**: Linux Ubuntu 16.04
- **TensorFlow installed from (source or binary)**: Source
- **TensorFlow version (use command below)**: r1.15
- **Bazel version (if compiling from source)**: 
- **CUDA/cuDNN version**: 10.0
- **GPU model and memory**: K80/12GB
- **Exact command to reproduce**:

### Describe the problem
I am using [ssd_mobilenet_v2_quantized_300x300_coc0.config](https://github.com/tensorflow/models/blob/master/research/object_detection/samples/configs/ssd_mobilenet_v2_quantized_300x300_coco.config) file from object detection API for quantization aware model training using tran.py script from legacy.


When I use `num_clones=number of GPU` and perform quantization aware training, training goes fine but [export_tflite_ssd_graph.py](https://github.com/tensorflow/models/blob/master/research/object_detection/export_tflite_ssd_graph.py) gives following error

```
I1104 20:35:37.946473 139919232743168 saver.py:1284] Restoring parameters from /home/ubuntu/dvs_human_detection/q_train_mc/model.ckpt-1000
2019-11-04 20:35:38.959815: W tensorflow/core/framework/op_kernel.cc:1651] OP_REQUIRES failed at save_restore_v2_ops.cc:184 : Not found: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1350, in _run_fn
    target_list, run_metadata)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1443, in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
	 [[{{node save/RestoreV2}}]]
  (1) Not found: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
	 [[{{node save/RestoreV2}}]]
	 [[save/RestoreV2/_711]]
0 successful operations.
0 derived errors ignored.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 1290, in restore
    {self.saver_def.filename_tensor_name: save_path})
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.NotFoundError: 2 root error(s) found.
  (0) Not found: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
	 [[node save/RestoreV2 (defined at /.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Not found: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
	 [[node save/RestoreV2 (defined at /.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[save/RestoreV2/_711]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'save/RestoreV2':
  File "/models/research/object_detection/export_tflite_ssd_graph.py", line 143, in <module>
    tf.app.run(main)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/.local/lib/python3.5/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/.local/lib/python3.5/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/models/research/object_detection/export_tflite_ssd_graph.py", line 139, in main
    FLAGS.max_classes_per_detection, use_regular_nms=FLAGS.use_regular_nms)
  File "/models/research/object_detection/export_tflite_ssd_graph_lib.py", line 287, in export_tflite_graph
    moving_average_checkpoint.name)
  File "/models/research/object_detection/exporter.py", line 111, in replace_variable_values_with_moving_averages
    read_saver = tf.train.Saver(ema_variables_to_restore)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
    self.build()
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 840, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 878, in _build
    build_restore=build_restore)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()


During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 1300, in restore
    names_to_keys = object_graph_key_mapping(save_path)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 1618, in object_graph_key_mapping
    object_graph_string = reader.get_tensor(trackable.OBJECT_GRAPH_PROTO_KEY)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py", line 915, in get_tensor
    return CheckpointReader_GetTensor(self, compat.as_bytes(tensor_str))
tensorflow.python.framework.errors_impl.NotFoundError: Key _CHECKPOINTABLE_OBJECT_GRAPH not found in checkpoint

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/ubuntu/models/research/object_detection/export_tflite_ssd_graph.py", line 143, in <module>
    tf.app.run(main)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/home/ubuntu/models/research/object_detection/export_tflite_ssd_graph.py", line 139, in main
    FLAGS.max_classes_per_detection, use_regular_nms=FLAGS.use_regular_nms)
  File "/home/ubuntu/models/research/object_detection/export_tflite_ssd_graph_lib.py", line 287, in export_tflite_graph
    moving_average_checkpoint.name)
  File "/home/ubuntu/models/research/object_detection/exporter.py", line 112, in replace_variable_values_with_moving_averages
    read_saver.restore(sess, current_checkpoint_file)
  File "/home/ubuntu/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 1306, in restore
    err, "a Variable name or other graph key that is missing")
tensorflow.python.framework.errors_impl.NotFoundError: Restoring from checkpoint failed. This is most likely due to a Variable name or other graph key that is missing from the checkpoint. Please ensure that you have not altered the graph expected based on the checkpoint. Original error:

2 root error(s) found.
  (0) Not found: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
	 [[node save/RestoreV2 (defined at /.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
  (1) Not found: Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint
	 [[node save/RestoreV2 (defined at /.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
	 [[save/RestoreV2/_711]]
0 successful operations.
0 derived errors ignored.

Original stack trace for 'save/RestoreV2':
  File "/models/research/object_detection/export_tflite_ssd_graph.py", line 143, in <module>
    tf.app.run(main)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/platform/app.py", line 40, in run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/.local/lib/python3.5/site-packages/absl/app.py", line 300, in run
    _run_main(main, args)
  File "/.local/lib/python3.5/site-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/models/research/object_detection/export_tflite_ssd_graph.py", line 139, in main
    FLAGS.max_classes_per_detection, use_regular_nms=FLAGS.use_regular_nms)
  File "/models/research/object_detection/export_tflite_ssd_graph_lib.py", line 287, in export_tflite_graph
    moving_average_checkpoint.name)
  File "/models/research/object_detection/exporter.py", line 111, in replace_variable_values_with_moving_averages
    read_saver = tf.train.Saver(ema_variables_to_restore)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 828, in __init__
    self.build()
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 840, in build
    self._build(self._filename, build_save=True, build_restore=True)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 878, in _build
    build_restore=build_restore)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 508, in _build_internal
    restore_sequentially, reshape)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 328, in _AddRestoreOps
    restore_sequentially)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/training/saver.py", line 575, in bulk_restore
    return io_ops.restore_v2(filename_tensor, names, slices, dtypes)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/ops/gen_io_ops.py", line 1696, in restore_v2
    name=name)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/op_def_library.py", line 794, in _apply_op_helper
    op_def=op_def)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 3357, in create_op
    attrs, op_def, compute_device)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 3426, in _create_op_internal
    op_def=op_def)
  File "/.local/lib/python3.5/site-packages/tensorflow_core/python/framework/ops.py", line 1748, in __init__
    self._traceback = tf_stack.extract_stack()

```

But when I set `num_clones=1`, training and  export both works. 
But single GPU training is very slow.
How can I perform quantization aware training on multi GPU?

Config file I used
```
model {
  ssd {
    num_classes: 1
    image_resizer {
      fixed_shape_resizer {
        height: 300
        width: 300
      }
    }
    feature_extractor {
      type: "ssd_mobilenet_v2"
      depth_multiplier: 0.6
      min_depth: 16
      conv_hyperparams {
        regularizer {
          l2_regularizer {
            weight: 3.99999989895e-05
          }
        }
        initializer {
          truncated_normal_initializer {
            mean: 0.0
            stddev: 0.0299999993294
          }
        }
        activation: RELU_6
        batch_norm {
          decay: 0.999700009823
          center: true
          scale: true
          epsilon: 0.0010000000475
          train: true
        }
      }
    }
    box_coder {
      faster_rcnn_box_coder {
        y_scale: 10.0
        x_scale: 10.0
        height_scale: 5.0
        width_scale: 5.0
      }
    }
    matcher {
      argmax_matcher {
        matched_threshold: 0.5
        unmatched_threshold: 0.5
        ignore_thresholds: false
        negatives_lower_than_unmatched: true
        force_match_for_each_row: true
      }
    }
    similarity_calculator {
      iou_similarity {
      }
    }
    box_predictor {
      convolutional_box_predictor {
        conv_hyperparams {
          regularizer {
            l2_regularizer {
              weight: 3.99999989895e-05
            }
          }
          initializer {
            truncated_normal_initializer {
              mean: 0.0
              stddev: 0.0299999993294
            }
          }
          activation: RELU_6
          batch_norm {
            decay: 0.999700009823
            center: true
            scale: true
            epsilon: 0.0010000000475
            train: true
          }
        }
        min_depth: 0
        max_depth: 0
        num_layers_before_predictor: 0
        use_dropout: false
        dropout_keep_probability: 0.800000011921
        kernel_size: 1
        box_code_size: 4
        apply_sigmoid_to_scores: false
      }
    }
    anchor_generator {
      ssd_anchor_generator {
        num_layers: 6
        min_scale: 0.20000000298
        max_scale: 0.949999988079
        aspect_ratios: 1.0
        aspect_ratios: 2.0
        aspect_ratios: 0.5
        aspect_ratios: 3.0
        aspect_ratios: 0.333299994469
      }
    }
    post_processing {
      batch_non_max_suppression {
        score_threshold: 9.99999993923e-09
        iou_threshold: 0.600000023842
        max_detections_per_class: 100
        max_total_detections: 100
      }
      score_converter: SIGMOID
    }
    normalize_loss_by_num_matches: true
    loss {
      localization_loss {
        weighted_smooth_l1 {
        }
      }
      classification_loss {
        weighted_sigmoid {
        }
      }
      hard_example_miner {
        num_hard_examples: 3000
        iou_threshold: 0.990000009537
        loss_type: CLASSIFICATION
        max_negatives_per_positive: 3
        min_negatives_per_image: 3
      }
      classification_weight: 1.0
      localization_weight: 1.0
    }
  }
}
train_config {
  batch_size: 24
  data_augmentation_options {
    random_horizontal_flip {
    }
  }
  data_augmentation_options {
    ssd_random_crop {
    }
  }
  optimizer {
    rms_prop_optimizer {
      learning_rate {
        exponential_decay_learning_rate {
          initial_learning_rate: 0.00400000018999
          decay_steps: 800720
          decay_factor: 0.949999988079
        }
      }
      momentum_optimizer_value: 0.899999976158
      decay: 0.899999976158
      epsilon: 1.0
    }
  }
  #fine_tune_checkpoint: "PATH_TO_BE_CONFIGURED/model.ckpt"
  #from_detection_checkpoint: true
  num_steps: 1000
}
train_input_reader {
  label_map_path: "./label_map.pbtxt"
  tf_record_input_reader {
    input_path: "./train_pos_neg_v2/train_dataset.record-00000-of-00100"
  }
}
eval_config {
  num_examples: 6200
  metrics_set: "coco_detection_metrics"
  use_moving_averages: true
  include_metrics_per_category: true
}
eval_input_reader {
  label_map_path: "./label_map.pbtxt"
  shuffle: false
  num_readers: 1
  tf_record_input_reader {
    input_path: "./test_pos_neg_v2/test_dataset.record-00000-of-00010"
  }
}

graph_rewriter {
  quantization {
    delay: 500
    weight_bits: 8
    activation_bits: 8
  }
}
```
NOTE* When I download pretrained model from zoo, conversion works


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

"Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint" error multi clone quantization aware training #7755

System information

Describe the problem

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

"Key BoxPredictor_0/BoxEncodingPredictor/act_quant/max not found in checkpoint" error multi clone quantization aware training #7755

Description

System information

Describe the problem

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions