Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using reinforce_loss #251

Open
fabrahman opened this issue Nov 23, 2019 · 19 comments
Open

Using reinforce_loss #251

fabrahman opened this issue Nov 23, 2019 · 19 comments
Labels
question Further information is requested

Comments

@fabrahman
Copy link

fabrahman commented Nov 23, 2019

Hi,

I was trying to write a function for computing reinforce loss (as below) when I realized you have this here.
In this regard, how I can use the TransformerDecoder with ‘infer_sample’ decoding strategy as the sample_fn? In your reinforce_loss it is mentioned that the sample_fn should return [ids, probabilities, sequence_length], However the TransformerDecoder will return logits instead of probabilities. Can you guide me how I can use ‘TransformerDecoder’ with your reinforce_loss function? It should be a lot cleaner compared to my approach.

The way I was doing it, was to call TransformerDecoder, with 'infer_sample' decoding strategy and then took the log_softmax. like following:
However, I am stuck in some steps, like gathering the log_probabilities according to the sample_ids. I could do it with numpy but not tensorflow.

        sample_output, sample_len = decoder(
            decoding_strategy='infer_sample',
            embedding = _embedding_fn,
            context=context_ids,
            context_sequence_length=context_len,
            max_decoding_length=max_decoding_length,
            end_token=end_token)


        logprobs = tf.nn.log_softmax(sample_output.logits , axis=-1) # shape [bs,sl, vocab]
        ids = sample_output.sample_id #shape [bs, sl]
#        sampleLogprobs = np.take_along_axis(logprobs, ids, axis=-1).squeeze(-1) #shape [bs, sl]
        sampleLogprobs = tf.gather(logprobs, ids, axis=-1) # ----> This doesn't work similar to torch.gather or np.take_along_axis. Doesn't produce a tensor with shape [bs, sl]

        ids = tx.utils.varlength_roll(ids, -context_len)  #final sample ids rolled
        ids_len = sample_len - context_len
        ids = ids[:, :tf.reduce_max(ids_len)]

        sampleLogprobs = tx.utils.varlength_roll(sampleLogprobs, -context_len) # final sample log_probs rolled
        sampleLogprobs = sampleLogprobs[:, :tf.reduce_max(ids_len)]

        reward = reward_fn() # some reward function
        mask = np.ones_like(sampleLogprobs, dtype=float)
        padded_indices = np.arange(mask[1]) >= ids_len[:, None] # mask all probs which are pad indices
        mask[padded_indices] = 0
        loss_rl = - sampleLogprobs * reward * mask # shape [bs, sl]

        loss_rl = np.sum(loss_rl) / np.sum(mask) # loss per BPE token

Also I am not sure if this is the right approach to compute loss, so either you can guide me with how to use reinforce_loss and TransformerDecoder sampling or help me with my own script, that would be highly appreciated.

@ZhitingHu
Copy link
Member

To "produce a tensor with shape [bs, sl]" from logits and sample_id, you may use sequence_sparse_softmax_cross_entropy and set

average_across_batch=False, 
average_across_timesteps=False, 
sum_over_batch=False, 
sum_over_timesteps=False

--
Another way of doing RL is to use SeqPGAgent, see examples/seq2seq_rl

Or refer to examples/seqgan to write by your own

@ZhitingHu ZhitingHu added the question Further information is requested label Nov 23, 2019
@fabrahman
Copy link
Author

fabrahman commented Nov 24, 2019

To "produce a tensor with shape [bs, sl]" from logits and sample_id, you may use sequence_sparse_softmax_cross_entropy and set

average_across_batch=False, 
average_across_timesteps=False, 
sum_over_batch=False, 
sum_over_timesteps=False

--
Another way of doing RL is to use SeqPGAgent, see examples/seq2seq_rl

Or refer to examples/seqgan to write by your own

Thanks you @ZhitingHu
In this regard, just want to double check if the following code is the right approach to first get the logprob tensor ([bs,sl]) and then mask out prefix and padded indices (indices beyond sample_length) to get the batch_loss of shape (bs,):

        sample_output, sample_len = decoder(
            decoding_strategy='infer_sample',
            embedding = _embedding_fn,
            context=context_ids,
            context_sequence_length=context_len,
            max_decoding_length=max_decoding_length,
            end_token=end_token)

        ids = sample_output.sample_id
        logits = sample_output.logits
        max_full_len = tf.reduce_max(sample_len)
        sampleLogprobs = tx.losses.sequence_sparse_softmax_cross_entropy(
            labels=ids[:,1:],
            logits=logits,
            sequence_length=sample_len - 1, ## question: I am assuming this should mask the the right-paddings of sample, right?
            average_across_timesteps=False,
            sum_over_timesteps=False,
            average_across_batch=False,
            sum_over_batch=False)

        mask = tf.sequence_mask(
            sample_len-1,
            dtype=tf.float32)
        mask_prefix = 1 - tf.sequence_mask(
            context_len-1,
            maxlen=max_full_len-1, #max_decoding_length-1,
            dtype=tf.float32)
        mask = mask * mask_prefix

        batch_loss = tx.utils.reduce_with_weights(
             tensor=sampleLogprobs,
             weights=mask,
             average_across_batch=False,
             average_across_remaining=True,
             sum_over_remaining=False)

So my questions are:
1- is it the right way to mask both prefix and indices beyond sample_length?
2- I should pass sample_length to 'sequence_length' argument of sequence_sparse_softmax_cross_entropy, right?
I would appreciate if you let me know of there is any mistake in this code?
Thank you so much in advance.

@ZhitingHu
Copy link
Member

The code looks good. A reference code here (which is basically the same as what you wrote): #147 (comment)

2- it's not really necessary cuz you'd do the mask with reduce_with_weights

@fabrahman
Copy link
Author

fabrahman commented Nov 26, 2019

@ZhitingHu Actually I am getting a OOM error when I add this RL loss the way I showed earlier to mle loss.
MLE loss works fine, part of the code which is generating a text (both sample and greedy for doing self-critical RL) are working fine. The text are generated and I could pass them to my classifier and get the reward. However, when I fetch the loss optimization, it throw following error . While this is not happening when I have multiple MLE loss like here. It is so weird since for computing the RL loss I am using the same sequence_sparse_softmax_cross_entropy call. Can you help me with that?
I attached part of my code and the error log here:
NOTE that I have a 1080Ti GPU and tried both batch size 2 and 1.

# For RL fine-tuning
def _get_sample_text(context_ids, context_len):
   sample_output, sample_len = decoder(
       decoding_strategy='infer_sample',
       embedding = _embedding_fn,
       context=context_ids,
       context_sequence_length=context_len,
       max_decoding_length=max_decoding_length,
       end_token=end_token)

   return sample_output, sample_len

def _get_sample_rolled(output, length, context_len):

   ids = output.sample_id
   ids = tx.utils.varlength_roll(ids, -context_len)  # final sample ids rolled
   ids_len = length - context_len
   ids = ids[:, :tf.reduce_max(ids_len)]

   return ids, ids_len

def _get_greedy_text(context_ids, context_len):

    greedy_res, greedy_len = decoder(
        decoding_strategy='infer_greedy',
        embedding=_embedding_fn,
        context=context_ids,
        context_sequence_length=context_len,
        max_decoding_length=max_decoding_length,
        end_token=end_token)
    greedy_ids = tx.utils.varlength_roll(greedy_res.sample_id, -context_len)
    greedy_ids_len = greedy_len - context_len
    greedy_ids = greedy_ids[:, :tf.reduce_max(greedy_ids_len)]

    return greedy_ids, greedy_ids_len

def compute_batch_loss(output, sample_len, context_len):
   max_full_len = tf.reduce_max(sample_len)
   ids = output.sample_id[:, :max_full_len]
   logits = output.logits[:, :max_full_len] #(bs, sl, vocab)

   sampleLogprobs = tx.losses.sequence_sparse_softmax_cross_entropy(
       labels=ids[:,1:],
       logits=logits[:,:-1,:],
       sequence_length=sample_len - 1, 
       average_across_timesteps=False,
       sum_over_timesteps=False,
       average_across_batch=False,
       sum_over_batch=False)

   mask = tf.sequence_mask(
       sample_len-1,
       dtype=tf.float32)
   mask_prefix = 1 - tf.sequence_mask(
       context_len-1,
       maxlen=max_full_len-1, #max_decoding_length-1,
       dtype=tf.float32)
   mask = mask * mask_prefix

   batch_loss = tx.utils.reduce_with_weights(
        tensor=sampleLogprobs,
        weights=mask,
        average_across_batch=False,
        average_across_remaining=True,
        sum_over_remaining=False)

   return batch_loss

    ## Loss MLE
    x1_len = tf.placeholder(tf.int32, shape=[None], name='x1_len')
    x1x4_ids = tf.placeholder(tf.int32, shape=[None, None], name='x1x4_ids')
    x1x4_len = tf.placeholder(tf.int32, shape=[None], name='x1x4_len')

    loss_mle = _get_recon_loss(x1x4_ids, x1x4_len, x1_len) # similar to the repo I mentioned

    ## Loss RL
    x1_ids = tf.placeholder(tf.int32, shape=[None, None], name='x1_ids')
    reward = tf.placeholder_with_default(tf.ones([batch_size]), shape=(config_train.train_batch_size,), name="reward")

    symbols_output, symbols_len = _get_sample_text(x1_ids, x1_len) # this works fine and I can run
    symbols_rl, len_rl = _get_sample_rolled(symbols_output, symbols_len, x1_len) # this works fine
    symbols_gr, len_gr = _get_greedy_text(x1_ids, x1_len) # this works fine

    batch_loss_rl = compute_batch_loss(symbols_output, symbols_len, x1_len) # I think adding this to my loss make the problem, but not sure exactly
    rl_loss = tf.reduce_mean(batch_loss_rl * reward)

    loss = (1 - config_train.w_rl) * loss_mle + config_train.w_rl * rl_loss

error log:

    sys.exit(main(argv))                                                                                        [95/1878]
  File "roc_rl_main_refacored.py", line 1001, in main
    _train_epoch(sess, epoch==0)
  File "roc_rl_main_refacored.py", line 724, in _train_epoch
    rets = sess.run(fetches, feed_dict, options=run_opts)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 95
0, in run
    run_metadata_ptr)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 11
73, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 13
50, in _do_run
    run_metadata)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 13
70, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1024,1024] and type float on /job:localhost/replica:0/tas
k:0/device:GPU:0 by allocator GPU_0_bfc
         [[{{node swap_in_transformer_decoder_1/layer_15/self_attention/multihead_attention/multihead_attention/value/Ten
sordot_1/MatMul_1}}]]

Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
  196.32MiB from transpose
  196.32MiB from OptimizeLoss/gradients/transformer_decoder_1/MatMul_grad/MatMul_1
  31.65MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_19/past_poswise_ln/ffn/conv1/Tensordot/MatMul_grad/Mat
Mul_1
  30.59MiB from swap_in_transformer_decoder_1/layer_23/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
  30.27MiB from swap_in_transformer_decoder_1/layer_17/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
  28.67MiB from swap_in_transformer_decoder_1/layer_19/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
  26.20MiB from swap_in_transformer_decoder_1/layer_16/past_poswise_ln/ffn/conv2/Tensordot_1/MatMul_1
  24.44MiB from swap_in_transformer_decoder_1/layer_16/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
  24.00MiB from swap_in_transformer_decoder_1/layer_14/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
  24.00MiB from swap_in_transformer_decoder_1/layer_14/past_poswise_ln/ffn/conv2/Tensordot_1/MatMul_1
  24.00MiB from swap_in_transformer_decoder_1/layer_15/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
  24.00MiB from swap_in_transformer_decoder_1/layer_15/past_poswise_ln/ffn/conv2/Tensordot_1/MatMul_1
  23.38MiB from swap_in_transformer_decoder_1/layer_13/past_poswise_ln/ffn/conv2/Tensordot_1/MatMul_1
  22.34MiB from swap_in_transformer_decoder_1/layer_13/past_poswise_ln/ffn/conv1/Tensordot_1/MatMul_1
  20.75MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_12/past_poswise_ln/ffn/conv2/Tensordot/MatMul_grad/Mat
Mul_1
  20.00MiB from swap_in_transformer_decoder_1/layer_20/past_poswise_ln/ffn/conv2/Tensordot_1/MatMul_1
  17.51MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_4/past_poswise_ln/ffn/conv2/Tensordot/MatMul_grad/MatM
ul_1

@ZhitingHu
Copy link
Member

I couldn't see the why here. What's in the fetches here?

  File "roc_rl_main_refacored.py", line 724, in _train_epoch
    rets = sess.run(fetches, feed_dict, options=run_opts)

If optimization (e.g,, train_op) is included: would OOM still happen if you exclude train_op from fetches? This is to see if it's the loss computation that caused OOM. Similarly, would you try omit loss_mle all together and see if it's still OOM?

@fabrahman
Copy link
Author

Actually, the code will work fine if I have only loss_mle, it is working even when I have multiple loss_mle (which means the sequence_sparse_softmax_cross_entropy is called several time for each loss_mle, similar to this code). However, once I add the rl_loss to the train_op it gives the OOM error. So the question is what am I doing wrong about the loss_rl since it is basically a call to sequence_sparse_softmax_cross_entropy function ( more details about how I compute loss_rl is in my previous post)
Here is the fetches:

        loss = (1 - config_train.w_rl) * loss_mle + config_train.w_rl * rl_loss
        train_op = tf.contrib.layers.optimize_loss(
            loss=loss,
            global_step=global_step,
            learning_rate=None,
            optimizer=opt,
            variables=trainable_variables)

        while training:
                reward_fetches = {
                    'sample_rl': symbols_rl,
                    'sample_len': len_rl,
                    'greedy_sym': symbols_gr,
                    'greedy_len': len_gr
                }
                reward_rets = sess.run(reward_fetches, feed_dict={
                    x1_ids: rets_data['batch']['x1_ids'], x1_len: rets_data['batch']['x1_len']
                })

                # prepare sample for classification
                sample_rl = format_generated_samples_for_clf(proc, reward_rets['sample_rl'], reward_rets['sample_len'])
                sample_base = format_generated_samples_for_clf(proc, reward_rets['greedy_sym'], reward_rets['greedy_len'])

                # add reward calculation here
                reward_rl = get_reward(rets_data['batch']['x4_emo'], sample_rl)
                reward_base = get_reward(rets_data['batch']['x4_emo'], sample_base)

                # self-critical reward
                reward_sc = [rr - rb for rr, rb in zip(reward_rl, reward_base)]
                print(reward_rl, reward_base, reward_sc) # just to see if reward is being computed correctly. 

                # (2) Optimize loss
                feed_dict = {
                    x1_ids: rets_data['batch']['x1_ids'],
                    x1_len: rets_data['batch']['x1_len'],
                    x1x4_ids: rets_data['batch']['x1x4_ids'],
                    x1x4_len: rets_data['batch']['x1x4_len'],
                    tau: config_train.tau,
                    tx.global_mode(): tf.estimator.ModeKeys.TRAIN,
                    reward: reward_sc
                }

                fetches = {
                    'train_op': train_op,
                    'step': global_step,
                }
                fetches.update(loss_dict)

                rets = sess.run(fetches, feed_dict, options=run_opts)
                step = rets['step']

@ZhitingHu
Copy link
Member

running train_op (in fetches) will consume GPU memory for gradient tensors. A quick test is to remove train_op from fetches and see if OOM is gone. If so, it means OOM is probably cuz rl_loss results in more gradient tensors when running train_op. I personally usually use tf.stop_gradient to locate the back-propagation path(s) that lead to this extra OOM gradient tensors

@fabrahman
Copy link
Author

@ZhitingHu Here is the error when I remove train_op from fetches.
But I am not pretty sure why we want to do that. Since when loss = loss_mle and I pass this to train_op, and then running fetches with this train_op everything is okay. Plus what the program is optimizing without any train_op?
error log when removing train_op from fetches:

Traceback (most recent call last):
  File "roc_rl_main_refacored.py", line 1012, in <module>
    tf.app.run()
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, i
n run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "roc_rl_main_refacored.py", line 1001, in main
    _train_epoch(sess, epoch==0)
  File "roc_rl_main_refacored.py", line 724, in _train_epoch
    rets = sess.run(fetches, feed_dict, options=run_opts)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950
, in run
    run_metadata_ptr)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 117
3, in _run
    feed_dict_tensor, options, run_metadata)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 135
0, in _do_run
    run_metadata)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 137
0, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: 2 root error(s) found.
  (0) Invalid argument: assertion failed: [] [Condition x == y did not hold element-wise:] [x (sequence_sparse_softmax_cro
ss_entropy_1/SparseSoftmaxCrossEntropyWithLogits/Shape_1:0) = ] [2 200] [y (sequence_sparse_softmax_cross_entropy_1/Sparse
SoftmaxCrossEntropyWithLogits/strided_slice:0) = ] [2 199]
         [[node sequence_sparse_softmax_cross_entropy_1/SparseSoftmaxCrossEntropyWithLogits/assert_equal/Assert/Assert (de
fined at /home/hannah/Counterfactual-StoryRW/third_party/texar/texar/losses/mle_losses.py:196) ]]
         [[mul_9/_5791]]
  (1) Invalid argument: assertion failed: [] [Condition x == y did not hold element-wise:] [x (sequence_sparse_softmax_cro
ss_entropy_1/SparseSoftmaxCrossEntropyWithLogits/Shape_1:0) = ] [2 200] [y (sequence_sparse_softmax_cross_entropy_1/Sparse
SoftmaxCrossEntropyWithLogits/strided_slice:0) = ] [2 199]
         [[node sequence_sparse_softmax_cross_entropy_1/SparseSoftmaxCrossEntropyWithLogits/assert_equal/Assert/Assert (de
fined at /home/hannah/Counterfactual-StoryRW/third_party/texar/texar/losses/mle_losses.py:196) ]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'sequence_sparse_softmax_cross_entropy_1/SparseSoftmaxCrossEntropyWithLogits/assert_equal[3/1885]
Assert':
  File "roc_rl_main_refacored.py", line 1012, in <module>
    tf.app.run()
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, i
n run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "roc_rl_main_refacored.py", line 416, in main
    batch_loss_rl = compute_batch_loss(symbols_output, symbols_len, x1_len)
  File "roc_rl_main_refacored.py", line 324, in compute_batch_loss
    sum_over_batch=False)
  File "/home/hannah/Counterfactual-StoryRW/third_party/texar/texar/losses/mle_losses.py", line 196, in sequence_sparse_so
ftmax_cross_entropy
    labels=labels, logits=logits)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/ops/nn_ops.py", line 3355, i
n sparse_softmax_cross_entropy_with_logits
    array_ops.shape(logits)[:-1]))
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/ops/check_ops.py", line 557,
 in assert_equal
    return control_flow_ops.Assert(condition, data, summarize=summarize)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/util/tf_should_use.py", line
 193, in wrapped
    return _add_should_use_warning(fn(*args, **kwargs))
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/ops/control_flow_ops.py", li
ne 163, in Assert
    return gen_logging_ops._assert(condition, data, summarize, name="Assert")
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/ops/gen_logging_ops.py", lin
e 74, in _assert
    name=name)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/framework/op_def_library.py"
, line 788, in _apply_op_helper
    op_def=op_def)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/util/deprecation.py", line 5
07, in new_func
    return func(*args, **kwargs)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 3616
, in create_op
    op_def=op_def)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/framework/ops.py", line 2005
, in __init__
    self._traceback = tf_stack.extract_stack()

Also, I got the exact same error as above when I tried using rl_loss_fine = tf.stop_gradient(rl_loss_fine).

I am sorry for inconvenience but I have no idea what's happenning or what am I doing wrong about rl_loss?

@ZhitingHu
Copy link
Member

ZhitingHu commented Dec 6, 2019

Removing train_op or using tf.stop_gradient is for debugging purpose -- to locate which portion of the code causes OOM. Once it's located and fixed, you do need to add back train_op for training.

Based on the error msg after removing train_op, it looks there is another bug related to sequence_sparse_softmax_cross_entropy in compute_batch_loss. It's necessary to fix this bug first.

@fabrahman
Copy link
Author

fabrahman commented Dec 7, 2019

Removing train_op or using tf.stop_gradient is for debugging purpose -- to locate which portion of the code causes OOM. Once it's located and fixed, you do need to add back train_op for training.

Based on the error msg after removing train_op, it looks there is another bug related to sequence_sparse_softmax_cross_entropy in compute_batch_loss. It's necessary to fix this bug first.

@ZhitingHu I was able to fix that bug, and now removing train_op or using tf.stop_gradient works without error.
When I add train_op back, I got following error. How do I realize which part is causing the OOM?
What I am doing is that I trained a classifier beforehand and I am using it to compute rewards for my RL. The classifier is built on pytorch. During my RL training, I am calling that pretrained classifier. I have two gpus and I let the model use both. At first I thought maybe sharing gpu between tensorflow and pytorch cause the error, but then I forced my pretrained classifier to work on cpu and I still get the following error:

2019-12-06 18:35:42.580893: W tensorflow/core/common_runtime/bfc_allocator.cc:319] *****************************************
***********************************************************
2019-12-06 18:35:42.580930: W tensorflow/core/framework/op_kernel.cc:1502] OP_REQUIRES failed at gpu_swapping_kernels.cc:72
: Resource exhausted: OOM when allocating tensor with shape[1024,1024] and type float on /job:localhost/replica:0/task:0/dev
ice:GPU:0 by allocator GPU_0_bfc
Traceback (most recent call last):
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1356,
 in _do_call
    return fn(*args)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1341,
 in _run_fn
    options, feed_dict, fetch_list, target_list, run_metadata)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1429,
 in _call_tf_sessionrun
    run_metadata)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1024,1024] and type float on /job:localhost/replica:0/task:0
/device:GPU:0 by allocator GPU_0_bfc
         [[{{node swap_in_transformer_decoder_1/layer_15/self_attention/multihead_attention/multihead_attention/key/Tensordo
t_1/MatMul_1}}]]

Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
  196.32MiB from transpose
  196.32MiB from OptimizeLoss/gradients/transformer_decoder_1/MatMul_grad/MatMul_1
  21.88MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_19/past_poswise_ln/ffn/conv1/Tensordot/MatMul_grad/MatMul
_1
  16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_23/past_poswise_ln/ffn/conv2/Tensordot/MatMul_grad/MatMul
_1
  16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_23/past_poswise_ln/ffn/conv1/Tensordot/MatMul_grad/MatMul
_1
  16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_22/past_poswise_ln/ffn/conv2/Tensordot/MatMul_grad/MatMul
_1
  16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_22/past_poswise_ln/ffn/conv1/Tensordot/MatMul_grad/MatMul
_1
  16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_21/past_poswise_ln/ffn/conv2/Tensordot/MatMul_grad/MatMul
_1
  16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_21/past_poswise_ln/ffn/conv1/Tensordot/MatMul_grad/MatMul
  16.00MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_0/past_poswise_ln/ffn/conv1/Tensordot/MatMul_gr[132/1934]
1
  7.75MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_21/self_attention/multihead_attention/multihead_attention/
key/Tensordot/MatMul_grad/MatMul_1
  7.49MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_22/self_attention/multihead_attention/multihead_attention/
query/Tensordot/MatMul_grad/MatMul_1
  7.49MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_22/self_attention/multihead_attention/multihead_attention/
key/Tensordot/MatMul_grad/MatMul_1
  6.60MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_14/self_attention/multihead_attention/multihead_attention/
key/Tensordot/MatMul_grad/MatMul_1
  6.48MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_18/self_attention/multihead_attention/multihead_attention/
value/Tensordot/MatMul_grad/MatMul_1
  6.48MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_15/self_attention/multihead_attention/multihead_attention/
output/Tensordot/MatMul_grad/MatMul_1
  6.40MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_19/self_attention/multihead_attention/multihead_attention/
query/Tensordot/MatMul_grad/MatMul_1
  6.38MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_21/self_attention/multihead_attention/multihead_attention/
output/Tensordot/MatMul_grad/MatMul_1
  6.36MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_7/self_attention/multihead_attention/multihead_attention/o
utput/Tensordot/MatMul_grad/MatMul_1
  6.26MiB from OptimizeLoss/gradients/transformer_decoder_1/layer_20/self_attention/multihead_attention/multihead_attention/
value/Tensordot/MatMul_g

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "roc_rl_main_refacored.py", line 1005, in <module>
    tf.app.run()
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/platform/app.py", line 40, in
run
    _run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 299, in run
    _run_main(main, args)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/absl/app.py", line 250, in _run_main
    sys.exit(main(argv))
  File "roc_rl_main_refacored.py", line 994, in main
    _train_epoch(sess, epoch==0)
  File "roc_rl_main_refacored.py", line 715, in _train_epoch
    rets = sess.run(fetches, feed_dict, options=run_opts)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 950,
in run
    run_metadata_ptr)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1173,
 in _run
    feed_dict_tensor, options, run_metadata)
File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1350,
 in _do_run
    run_metadata)
  File "/home/hannahbrahman/anaconda3/envs/py36/lib/python3.7/site-packages/tensorflow/python/client/session.py", line 1370,
 in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: 2 root error(s) found.
  (0) Resource exhausted: OOM when allocating tensor with shape[1024,1024] and type float on /job:localhost/replica:0/task:0
/device:GPU:0 by allocator GPU_0_bfc
         [[{{node swap_in_transformer_decoder_1/layer_15/self_attention/multihead_attention/multihead_attention/key/Tensordo
t_1/MatMul_1}}]]

Current usage from device: /job:localhost/replica:0/task:0/device:GPU:0, allocator: GPU_0_bfc
  196.32MiB from transpose
  196.32MiB from OptimizeLoss/gradients/transformer_decoder_1/MatMul_grad/MatMul_1

@ZhitingHu
Copy link
Member

hmm... The OOM is caused by the optimization (backward pass). Gradients of rl_loss_fine and loss_mle should consume the same amount of memory, respectively. To verify this -- since you've tried loss = loss_mle and passed this to train_op and it worked, does setting loss = rl_loss_fine work (i.e., no OOM)?

You may use tf.device to partition the model on different GPUs. E.g., place the forward pass on one GPU, and train_op (backward pass) on the other.

Another effective way to reduce memory consumption is to use a smaller max_seq_length

@fabrahman
Copy link
Author

fabrahman commented Dec 7, 2019

@ZhitingHu I really appreciate your help.
Yeah, that is a good test and actually I tried with just loss==rl_loss_fine and it threw the same error. Note that, I used a batch_size=1 for this test. On the other hand, the model with loss=loss_mle worked with batch_size=2 as well. Is it really a OOM error?

How do I make sure, that during infer_sample and infer_greedy for RL, the model is reusing the parameters defined in train_greedy decoder? Should I use a tf.variable_scope(, reuse=True) somehow? Do you think that might be the reason for the error?
I am trying seq_len reduction and device partition as well but wanna make sure if it is really OOM.

This where I sample two outputs for rl, and this where I compute rl_loss_fine. And where compute reward.

@fabrahman
Copy link
Author

fabrahman commented Dec 8, 2019

@ZhitingHu I changed the max_seq_len from 200 to 128 and still get the same error for rl_loss_fine.
Technically since both loss_mle and rl_loss_fine are using CE loss with respect to same parameters, they should consume the same amount of memory in backward path but with these tests it has been showed that it is not the case.
Also when I have multiple call to mle_loss (I mean weighted some of mle_loss) it is still working.

@fabrahman
Copy link
Author

I just wanted to check if negative loss (may happen when reward of greedy output (r_base) is greater than reward of sampled output (r_sample) ) or very small loss (most of the time the difference of these two loss are very small and multiplying them by log_prob results in small values) may cause problem in backward path?

@ZhitingHu
Copy link
Member

@ZhitingHu I really appreciate your help.
Yeah, that is a good test and actually I tried with just loss==rl_loss_fine and it threw the same error. Note that, I used a batch_size=1 for this test. On the other hand, the model with loss=loss_mle worked with batch_size=2 as well. Is it really a OOM error?

How do I make sure, that during infer_sample and infer_greedy for RL, the model is reusing the parameters defined in train_greedy decoder? Should I use a tf.variable_scope(, reuse=True) somehow? Do you think that might be the reason for the error?
I am trying seq_len reduction and device partition as well but wanna make sure if it is really OOM.

This where I sample two outputs for rl, and this where I compute rl_loss_fine. And where compute reward.

Texar automatically reuses variables. No need to add things like tf.variable_scope(, reuse=True).

FYI, here is an example code of using Texar for self-critic learning, where

  • L.356 is calculating (reward_sample - reward_greedy)
  • L.392 is calculating log p_theta(sample)

@fabrahman
Copy link
Author

@ZhitingHu I really appreciate your help.
Yeah, that is a good test and actually I tried with just loss==rl_loss_fine and it threw the same error. Note that, I used a batch_size=1 for this test. On the other hand, the model with loss=loss_mle worked with batch_size=2 as well. Is it really a OOM error?
How do I make sure, that during infer_sample and infer_greedy for RL, the model is reusing the parameters defined in train_greedy decoder? Should I use a tf.variable_scope(, reuse=True) somehow? Do you think that might be the reason for the error?
I am trying seq_len reduction and device partition as well but wanna make sure if it is really OOM.
This where I sample two outputs for rl, and this where I compute rl_loss_fine. And where compute reward.

Texar automatically reuses variables. No need to add things like tf.variable_scope(, reuse=True).

FYI, here is an example code of using Texar for self-critic learning, where

* L.356 is calculating `(reward_sample - reward_greedy)`

* L.392 is calculating `log p_theta(sample)`

Thank you so much @ZhitingHu. This was really helpful, I was able to figure out what I am doing wrong and now the OOM error is gone.

@ZhitingHu
Copy link
Member

Glad to hear that! :) Could you briefly explain the cause of OOM, for future reference? Thanks

@fabrahman
Copy link
Author

Glad to hear that! :) Could you briefly explain the cause of OOM, for future reference?

Glad to hear that! :) Could you briefly explain the cause of OOM, for future reference? Thanks

Thanks
Sure, what I was doing wrong was:
I was taking the sample_id and logits of the decoder in infer_sample decoding strategy and passed this to sequence_sparse_softmax_cross_entropy to compute logp.

However, I should have fixed sample_id (eos stripped and padded to same size) and then use this as input to decoder in a train_greedy decoding strategy and then used this output (sample_id, logits) to compute logp similar to how I compute mle_loss.

@fabrahman
Copy link
Author

I know it's kind of irrelevant to implementation details but I wanted to know during training when we want to do evaluation on a dev set periodically, Is it more common to compute the reward on the greedy output or the most probable beam_search (of specific width)? Or both approach is common?

Thanks

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

2 participants