-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite augmentation pipeline #81
Comments
note that this has been discussed: #60 |
https://albumentations.ai/docs/api_reference/augmentations/ seems best, especially because we are concerned with environmental imagery, and the functional augs include sun glint, snow, and fog https://albumentations.ai/docs/api_reference/augmentations/functional/ |
2024 and this is still a christmas wish I think I could take this on this year and would base it around
|
Question: so I am guessing these augmentations get done at the time of training, and new images are not actually saved? |
Correct. Gym works by preparing your dataset for you and making batched tensors of augmented data. This is deliberately done so you always know what data is used for training and what for validation. Importantly only the training data is augmented. I would recommend we eventually modified the make_dataset.py function with an albumentations based workflow. But yes for now you could trial model training by augmenting the imagery first. But note that would be suboptimal in the long term because it needlessly duplicates image files. So let's put a basic wirkflow together and then ideally wrap that into the existing Gym workflow. |
Just so we are all on the same page - make_datasets actually creates the augmented images, which are saved as npz files. then train_model uses those (augmented) images (which are npz) to train the model. So images are not augmented 'on the fly' like in many workflows (i.e., preprocessing layers in the model, data generators, etc), but rather pre-augmented. I recall the biggest reason we did this was for efficiency (GPU utilization is always near 100% for me, compared with many 'on the fly' augmentation strategies where GPu utilization is lower, at the expense of more CPU) @mlundine - i agree that albumentations is the correct way to go. |
Yes that's a good summary. Pre augmentation (as oppsed to on the fly) has reproducibility benefits too. In the sense that the augmented data are saved in the "gpu ready" npz format, and it would be possible to in theory assess the distributions of augmented data post-hoc rather than the non-reproducible ad-hoc. I think we're all interested in albumentations and I'm keen to get it at least as an option in the gym workflow |
@mlundine - just loopiong back to getting Albumentations working w/o rewriting the augmentation pipeline: Since we use the deprecated/old-style keras generators, the easiest method is to add a preprocessing function (https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator) in 3 easy steps:
so add
for both generators hope this helps as a quick way to get Albumentations working! segmentation_gym/make_dataset.py Lines 719 to 749 in cb13c70
|
Yes I understand the way you guys were doing this now and why.
For just the training set, we have a set of augmentations we can perform.
We randomize which augmentation to perform and on which image from the
training set, correct?
…On Wed, Apr 24, 2024 at 3:18 PM Evan B. Goldstein ***@***.***> wrote:
@mlundine <https://github.com/mlundine> - just loopiong back to getting
Albumentations working w/o rewriting the augmentation pipeline:
Since we use the deprecated/old-style keras generators, the easiest method
is to add a preprocessing function (
https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator)
in 3 easy steps:
1. course adding an import:
#import albumentations
import albumentations as A
2. defining a preprocessing function with your chosen albumentation
augs:
#preprocessing function with albumentations.. example with channel shuffle
def albumentize(image):
aug = A.Compose([
A.ChannelShuffle(),
])
AugI = aug(image=image)['image']
return AugI
3. add a call to the preprocessing function on line 719-739 of
make_dataset.py
so add
preprocessing_function = albumentize, under fill_mode='reflect',
for both generators
hope this helps as a quick way to get Albumentations working!
https://github.com/Doodleverse/segmentation_gym/blob/cb13c70d98bc9fe91b51ee5937d2b5cd3c516e6c/make_dataset.py#L719-L749
—
Reply to this email directly, view it on GitHub
<#81 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/APHKACT7LFKAMKB5BBKKATTY7AVUNAVCNFSM6AAAAAAQMUWIGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZVHE2DMMJQHA>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Clarifying that more: we don't want duplicates (original image and
augmented) in the training set? Or do we want a big training set with all
original images plus each augmentation?
Just from experimenting a bit with albumentations, I think the ones we want
(at least for satellite imagery) are the color-space alterations and the
snow transform (just adding white pixels). The haze transform is kind of
dumb, it's just circular blobs of haze. The other one that might be useful
is the elastic transform (see attached images for original, color swapping,
elastic, and snow). These would be in addition to the more standard
augmentations you guys already have (rotations, flips, zooms, etc.).
[image: 2022-07-09-22-25-01_RGB_L9.jpg]
[image: 2022-07-09-22-25-01_RGB_L9augment2.jpg][image:
2022-07-09-22-25-01_RGB_L9augment99.jpg][image:
2022-07-09-22-25-01_RGB_L9snow.jpg]
…On Wed, Apr 24, 2024 at 4:45 PM Mark Lundine ***@***.***> wrote:
Yes I understand the way you guys were doing this now and why.
For just the training set, we have a set of augmentations we can perform.
We randomize which augmentation to perform and on which image from the
training set, correct?
On Wed, Apr 24, 2024 at 3:18 PM Evan B. Goldstein <
***@***.***> wrote:
> @mlundine <https://github.com/mlundine> - just loopiong back to getting
> Albumentations working w/o rewriting the augmentation pipeline:
>
> Since we use the deprecated/old-style keras generators, the easiest
> method is to add a preprocessing function (
> https://www.tensorflow.org/api_docs/python/tf/keras/preprocessing/image/ImageDataGenerator)
> in 3 easy steps:
>
> 1. course adding an import:
>
> #import albumentations
> import albumentations as A
>
>
> 2. defining a preprocessing function with your chosen albumentation
> augs:
>
> #preprocessing function with albumentations.. example with channel shuffle
> def albumentize(image):
> aug = A.Compose([
> A.ChannelShuffle(),
> ])
> AugI = aug(image=image)['image']
>
> return AugI
>
>
> 3. add a call to the preprocessing function on line 719-739 of
> make_dataset.py
>
> so add
>
> preprocessing_function = albumentize, under fill_mode='reflect',
>
> for both generators
>
> hope this helps as a quick way to get Albumentations working!
>
> https://github.com/Doodleverse/segmentation_gym/blob/cb13c70d98bc9fe91b51ee5937d2b5cd3c516e6c/make_dataset.py#L719-L749
>
> —
> Reply to this email directly, view it on GitHub
> <#81 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/APHKACT7LFKAMKB5BBKKATTY7AVUNAVCNFSM6AAAAAAQMUWIGKVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDANZVHE2DMMJQHA>
> .
> You are receiving this because you were mentioned.Message ID:
> ***@***.***>
>
|
The way the we wrote it, the trainign split will all be augmentations, Val split is all non-augmented images in the validation. That being said, all the augmentations are random, so there is a possibility to get nonagumented (or weakly augmented) images in the training. note also that in the config, |
I suggest if you want an albumentation version of Gym, feel free to create a branch (locally or on GH)... you could hard code it all in for your personal needs, but it would be awesome if you added variables to the config so that they can be turned on/off globally for everyone eventually |
I agree with Evan. It seems the change he is suggesting here #81 (comment) is simple enough it could be incorporated in the existing workflow easily (on a new branch). Doodleverse is definitely designed with a broad range of users and use-cases in mind. Perhaps it could be passed a list of albumentations-style augmentations you'd like. And if the list if empty (default), it just defaults to the status quo. And yes, I have noticed that models tend to train better when presented with original plus augmented training data. There is no data leakage because the validation files are stored in a separate folder and are not augmented. If you wish to test this yourself,
If you wish, you could add a config file parameter than suppresses the use of original imagery in training, but I recommend keeping original+augmentation by default |
In the TF docs from 2.9 on,
tf.keras.preprocessing
has a deprecation warning:https://www.tensorflow.org/versions/r2.10/api_docs/python/tf/keras/preprocessing
This will impact the make_data script, which relies on this suite of tools (i.e.,
tf.keras.preprocessing.image.ImageDataGenerator
) to make the augmented imagery. See here:segmentation_gym/make_nd_dataset.py
Lines 578 to 800 in c1669a0
In light of this, it seems wise to think/plan/prepare for the moment when we need to convert the augmentation routines to the recommended workflow using
tf.keras.utils
.. the relevant links in the TF documentation can be found in the link above.The text was updated successfully, but these errors were encountered: