-
Notifications
You must be signed in to change notification settings - Fork 328
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Canary test] GridMask: Vectorized function call on the batch dimension #146
Conversation
Thanks for bring up this issue.
@fchollet and @LukeWood for more inputs as well. @joker-eph, do u have any suggestion about tf.vectorized_map that reads a RNG within the fn? |
Probably this could be true when this is done on CPU with enough computational margin and parallel overlapping with the GPU forward/backward kernels on the GPU stream. But we need to be performant enough cause if the GPU will start to be hungry of data sometime is faster to move some preprocessing directly on the GPU device. E.g. see the rationale of Nvidia DALI project. So we need to be reasonable with performance and at least we need to rely with Other then having a green light for More in general see our endless thread on the forum /cc @jpienaar
In the documentation we currently have:
|
Just an extra note about 3) see my 2018 thread at NVIDIA/DALI#247 (comment) |
Re 3: I share the similar concern of performance and code readability for augment single image and a batch of images when I worked on keras-team/keras@9628af8. We could add |
What do you mean here? Vectorized with |
I mean if the native vectorized op support randomness within the batch, then we should use them. Otherwise, |
But this point is exactly the beginning of this thread. Can you just approve the CI run so you can see the failing output? If you need to vectorize but we need to have a mandatory "within the batch" randomization we could currently have a systematic limit with At the same time we have also some generic compile limits (jit_compile) for many transformations (like rotation) for the |
I don't the numerical result that the randomness within batch will have advantage of convergence, but from math point of view, the image random preprocess shouldn't have any dependency of the batch size. In the extreme case, if you have a very large batch size that is closer to your size of training data, then it means your model is overfitting to your specific augmentation within the batch. With distribution strategy and most of the data parallel solution, we are having larger batch size overall, and it could potential be a problem. |
How many real cases we have batch_size=whole dataset? About the distributed training it is a different topic as we can find a way with TF to be random x device/node batch. |
An alternative approach could be: https://arxiv.org/abs/1901.09335 Edit: 4.4
|
The batch_size == len(dataset) is an extreme case, but even without that assumption, with large enough batch size, the regularization and normalization we expected to get from the random augmentation is reduced if the same batch get the same random behavior. From the math point of view, you would like your model to learn from different look of inputs, so that it can generalize, and not overfitting to specific input it has already seen. Having the same randomness within the batch just add some extra to this problem, which we don't know how large the impact would be. The question here to me here is that whether we want to trade off between computation performance against accuracy. From the paper you provided, the need of creating different augmentation within the batch is clear. |
If you want to have a trade off with large batches you could probably augment not per single sample as map_fn + random but few data samples/images with the same random sample and then repeat the same (or eventually) another sub-batch with a new random value augmentation. So you could vectorize x sub-batches to compose a single batch. |
I mentioned that for the distributed setting in 4.4 with a different not syncronized random augmentation seed for every node. |
Also we have just merged this commit in the Doc/website tensorflow/docs@f292e1c#diff-8e0fa0ce3afd5d39976905d770a157069fc2716bb47b449ec191027193656ac6R688 |
We are almost discuraging to use Keras in-model preprocessing with distributed strategies. |
I don't know if @w-xinyi has some specific feedback for this distributed and preprocessing claim that he committed on the doc/website. |
Btw, in this change, you are using Note that from the API doc:
Since only the RNG part will be in a tf.while_loop, I guess it will still achieve some performance gain since the main workload is in the image augmentation ops. |
Yes I know but it was already failing also with the fallback, I switch it to true so you can see the CI output. Other then this, the doc claim is a little bit ambiguous, so who is going to analyze the HLO graph in details with or without the fallback if we don't have a non-fallback reference performance? This will not help up to select our batch assembly policy. Also if we are not going to consider the instance repetition (duplication) with augmentation we are going to increase also the IO pressure on the filesystem. |
Humm, it is unclear to me why the intermediate result in the while loop produces different shape of results and cause the pfor to fail. @wangpengmit and @mdanatg from tf core team. |
@qlzh727 Also if the fallback was working correctly I doubt that we could have too much speed improvement with this approach. If you check the CI output it is not only the
|
@qlzh727 Do you have any feedback from the XLA team members that you have mentioned here? |
Sorry for the late reply, We haven't heard any updates from core team yet for the pfor issue. I think we need to take a closer look for the implementation details about the grid_mask. It might miss some shape information somewhere, so that each step of the iteration might get different shape result based on the input value. |
Or these could be cause by our randomization policy. |
vectorized_map . Try using a constant maxval (e.g. 1.0 ) and then transforming the result of tf.random.uniform using gridblock .
|
@wangpengmit As the We have tried to open a ticket in the TF core repository when we found something similar like in tensorflow/tensorflow#54479 |
Unsupported ops will cause For u = tf.random.uniform(shape=shape, minval=0.0, maxval=1.0, dtype=tf.float32)
length = tf.cast(u * gridblock + 1, dtype=tf.int32) ? |
We could refactor something to workaround the
|
Yeah, this is a |
So I suppose that we will not have a sensible performance gain with all these fallback. I have few extra points:
|
I don't know too much about
Yes please. Either a bug per op or a bug for all is fine. Please provide a minimal reproduce with just the op of interest in it to show that
I don't know of such a list. It's a nice thing to have, ideally a matrix with ops and the supports they enjoy. Feel free to file a bug for this too.
Yes please. |
Do you know who is the API/Codeowner there? As It is really hard to interact with the right owner in TF What I think it is not helping in the current
|
The XLA ticket is at tensorflow/tensorflow#55194 Fallback loop "interplay/improved debugging/hints" at tensorflow/tensorflow#55192 |
|
@wangpengmit I've added this case in tensorflow/community#412. Do we need to wait for a new Codeowner to review tensorflow/tensorflow#55192 ? Instead, do we have a Codeowner for tensorflow/tensorflow#55194? |
I've reviewed it.
Added an assignee. |
@bhack closing this as we are using BaseImageAugmentationLayer now. If there is a reason to, please feel free to re-open this PR to discuss further. Thanks for the contribution and thanks for following up with the tensorflow team on the operations supported in vmap. |
CCing @ishark . |
* Add separable conv layer * Fix comments
As we have discussed in #143 (comment) this is just a canary (failing) test (check the CI):
As I've mentioned in the thread we really need to understand if we want to have randomness inside the batch or between the batches and what kind of impact we have between the computing overhead, contributing speed/code readability and network convergence.
Also I don't know if @joker-eph or @qlzh727 could expose us a little bit the pro and cons of
jit_compile
a function vs using thevectorized_map
or if they are orthogonal.With many CV transformations we cannot compile the function as the underline
tf.raw_ops.ImageProjectiveTransformV3
op isn't supported by XLA./cc @chjort