MAIA encounter problems about python package requirements in gpu-worker dockerfile #533

cywhale · 2023-01-12T06:10:18Z

cywhale
Jan 12, 2023

Errors when running MAIA novelty detection job:

Error while executing python command '/var/www/vendor/biigle/maia/src/config/../resources/scripts/novelty-detection/DetectionRunner.py /var/www/storage/maia_jobs/maia-12-novelty-detection/input.json':
Traceback (most recent call last):
  File "/var/www/vendor/biigle/maia/src/config/../resources/scripts/novelty-detection/DetectionRunner.py", line 8, in <module>
    from AutoencoderSaliencyDetector import AutoencoderSaliencyDetector
  File "/var/www/vendor/biigle/maia/src/resources/scripts/novelty-detection/AutoencoderSaliencyDetector.py", line 6, in <module>
    from scipy.misc import imsave

Google this message and it's said "imsave is deprecated in SciPy 1.2.x". I modified the requirements.txt from scipy==1.7.2 to scipy==1.2.1, re-build, the job then continues, but notice other warings in this build process:

tensorboard 2.7.0 requires requests<3,>=2.21.0, which is not installed.
tensorflow 2.6.1 requires numpy~=1.19.2, but you'll have numpy 1.22.4 which is incompatible

Remark that the docker pull get error when using tensorflow 2.5.3, so I modify the gpu-worker.dockerfile to:

FROM tensorflow/tensorflow:2.6.1-gpu

Although the novelty job keep running which started 2.5hr before, it's extremely slower than my previous experience to run it
(My test volume has 16 photos, each with 2048 x 1536(pixels) and about 990KB in size)
I find a temporary logs in subdirectory maia-xx-novelty-detection under maia-jobs. It seems tell me "To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.".
Is there anything wrong or prerequisites I missed to use TensorFlow in this MAIA job with GPU?
Thanks.

2023-01-12 11:38:30.711563: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-12 11:38:30.764367: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 11:38:30.862558: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 11:38:30.862772: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 11:42:08.427739: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 11:42:08.427972: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 11:42:08.428083: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 11:42:08.428535: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1510] Created device /job:localhost/replica:0/task:0/device:GPU:0 with 5948 MB memory:  -> device: 0, name: NVIDIA A16-8Q, pci bus id: 0000:02:00.0, compute capability: 8.6
Cluster 1 of 5
  Training
2023-01-12 11:45:51.175607: I tensorflow/stream_executor/cuda/cuda_blas.cc:1760] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
Epoch: 0001 (23.94529)
Epoch: 0002 (1.68409)
Epoch: 0003 (1.39512)
Epoch: 0004 (1.31752)
Epoch: 0005 (1.27525)
Epoch: 0006 (1.21033)
Epoch: 0007 (1.62979)
Epoch: 0008 (1.20538)
Epoch: 0009 (1.19575)
Epoch: 0010 (1.17660)
Epoch: 0011 (1.14732)
Epoch: 0012 (1.13520)
Epoch: 0013 (1.13744)
Epoch: 0014 (1.11700)
Epoch: 0015 (1.11114)
Epoch: 0016 (1.11006)
Epoch: 0017 (1.09480)
Epoch: 0018 (1.09669)
Epoch: 0019 (1.08922)
Epoch: 0020 (1.08655)
Epoch: 0021 (1.08373)
Epoch: 0022 (1.06501)

Remark: the MAIA novelty detection keeps using 607MB GPU memory (total 8G) by watching nvidia-smi:

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|    0   N/A  N/A      1472      G   /usr/lib/xorg/Xorg                  7MiB |
|    0   N/A  N/A    205635      C   /usr/bin/python3                  607MiB |
+-----------------------------------------------------------------------------+

mzur · 2023-01-12T08:39:06Z

mzur
Jan 12, 2023
Maintainer

I found TensorFlow >2.5 to be incompatible with the current MAIA code. The novelty detection could work but the instance segmentation produces incorrect results. For biigle.de we build a TensorFlow 2.5.3 Docker image ourselves. You can clone biigle/tensorflow and check out the v2.5.3-biigle branch. Then build the Docker image with:

docker build --build-arg TF_PACKAGE_VERSION=2.5.3 -f ./dockerfiles/gpu.Dockerfile -t tensorflow/tensorflow:2.5.3-gpu .

The older TensorFlow version also accumulated lots of security vulnerabilities. Our way forward is to port MAIA to PyTorch and MMDetection (biigle/maia#96). We hope that this will make future improvements easier. I'm currently working on a PyTorch implementation for the novelty detection but it may take a while until it is finished.

0 replies

cywhale · 2023-01-13T03:14:33Z

cywhale
Jan 13, 2023
Author

Following your suggestion, I built tensorflow:2.5.3-gpu, an re-run the MAIA novelty detection.
It seems work well and no errors/warnings in log.txt of maia job, but it's extremely time-consuming for sequentially running 100 Epoch for each cluster 1-5. Now overnight running just to cluster 2. I'm afraid it'll have to run for three days for this test volume (with 16 photos, each with 2048 x 1536(pixels) and about 990KB in size).
I think it's not a normal condition, but have no idea how to find something I did is wrong? The log.text is as following.
Another problem is I cannot delete the Maia jobs by any means, either use UI (cannot click delete jobs), or by API:

curl --insecure -I -X DELETE -u SU@email:TOKEN -H "Accept: application/json" https://localhost:8008/api/v1/maia-jobs/ID

or kill python3 processes, or restart docker composer, even reboot my machine. The Maia job always keeps runnning.
Is there any way to permanently delete the running MAIA job?
Thanks a lot!

2023-01-12 19:26:49.995464: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2023-01-12 19:26:52.760122: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-12 19:26:52.766344: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2023-01-12 19:26:52.811290: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 19:26:52.811472: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:02:00.0 name: NVIDIA A16-8Q computeCapability: 8.6
coreClock: 1.755GHz coreCount: 10 deviceMemorySize: 7.83GiB deviceMemoryBandwidth: 186.29GiB/s
2023-01-12 19:26:52.811526: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2023-01-12 19:26:52.852287: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2023-01-12 19:26:52.852425: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2023-01-12 19:26:52.874003: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcufft.so.10
2023-01-12 19:26:52.883789: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcurand.so.10
2023-01-12 19:26:52.894023: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusolver.so.11
2023-01-12 19:26:52.905110: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcusparse.so.11
2023-01-12 19:26:52.905402: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2023-01-12 19:26:52.905537: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 19:26:52.905709: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 19:26:52.905781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2023-01-12 19:26:52.907538: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2023-01-12 19:30:29.716408: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-01-12 19:30:29.716477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2023-01-12 19:30:29.716486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
2023-01-12 19:30:29.737406: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 19:30:29.737625: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 19:30:29.737737: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-12 19:30:29.737839: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5946 MB memory) -> physical GPU (device: 0, name: NVIDIA A16-8Q, pci bus id: 0000:02:00.0, compute capability: 8.6)
2023-01-12 19:30:29.918711: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2095074999 Hz
Cluster 1 of 5
  Training
2023-01-12 19:30:35.636502: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2023-01-12 19:34:12.819266: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2023-01-12 19:34:12.819365: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.
Epoch: 0001 (21.29344)
Epoch: 0002 (1.75034)
Epoch: 0003 (1.55710)
Epoch: 0004 (1.37036)
.... (skip)
Epoch: 0099 (1.04996)
Epoch: 0100 (1.03017)
  Image 1 of 3 (#30437)
  Image 2 of 3 (#30432)
  Image 3 of 3 (#30428)
Cluster 2 of 5
  Training
Epoch: 0001 (22.62573)
Epoch: 0002 (2.04780)
Epoch: 0003 (2.01229)
Epoch: 0004 (2.00678)
Epoch: 0005 (2.61376)
Epoch: 0006 (1.99424)
Epoch: 0007 (2.15851)
Epoch: 0008 (1.70630)
Epoch: 0009 (1.64481)
Epoch: 0010 (1.61440)
Epoch: 0011 (1.64031)
Epoch: 0012 (1.55534)
Epoch: 0013 (1.55000)
Epoch: 0014 (12.97591)
Epoch: 0015 (2.16757)
...(skip)

3 replies

mzur Jan 13, 2023
Maintainer

Killing the Python process or rebooting your machine should definitely cancel the MAIA job. You say it restarts automatically every time? Or does it run on a remote machine?

The slow execution you describe usually means that processing is done on the CPU. However, your log output indicates that the GPU was detected successfully. Did it work faster with a previous TensorFlow version? You could try to downgrade the version.

cywhale Jan 13, 2023
Author

It run on our virtual machine, but even I reboot that virtual machine, when I link Biigle, the spinning icon is still there and thus cannot create new Maia Job. I'll reproduce this problem and give more details after this debugging... I'll check the job slow execution on GPU's problem with our IT first. (Any comments would be appreciated). Thanks.

mzur Jan 13, 2023
Maintainer

I see, so the Python process is probably killed but this was not propagated back into BIIGLE. In this case, the worker queue timeout will cancel the job after 24 hours. If you want to speed this up, you could delete the job through the interactive shell:

$ docker compose exec worker php artisan tinker
> $job = Biigle\Modules\Maia\MaiaJob::find(JOB_ID)
> $job->delete()

cywhale · 2023-01-17T08:20:01Z

cywhale
Jan 17, 2023
Author

I decided to reboot, and the Novelty Detection ran fast within 5 minutes. So I think the guess is right: The previous slow-execution problem may due to it ran with CPU (somehow it cannot use GPU? although from nvidia-smi watching, the python3 process is activiated and GPU memory is occupied).

However, I started to run Instance segmentation stage following this novelty detection results. It got some errors, ran very slow (again) and finally got Core dumped (for a whole day running). The logs is as following. Any comments/tips to debug this or try would be appreciated. Thans a lot!!

Errors including (extracted from the whole logs):

Start cannot spawn child process: No such file or directory
tensorflow/stream_executor/gpu/asm_compiler.cc:56] Couldn't invoke ptxas --version
Internal: Failed to launch ptxas
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
....
Floating point exception (core dumped)
....
python3[15676] trap divide error ip:7fed9f365988 sp:7fef2e7b23c0 error:0 in libcudnn_cnn_infer.so.8.1.0

Total GPU memory is 8 GB and consumed 7.x GB.
The whole logs:

[2023-01-17 11:20:42] production.ERROR: Error while executing python command '/var/www/vendor/biigle/maia/src/config/../resources/scripts/instance-segmentation/TrainingRunner.py /var/www/storage/maia_jobs/maia-19-instance-segmentation/input-training.json /var/www/storage/maia_jobs/maia-19-instance-segmentation/output-dataset.json':
2023-01-16 11:33:12.281522: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
Matplotlib created a temporary config/cache directory at /tmp/matplotlib-_nx7yqrm because the default path (/.config/matplotlib) is not a writable directory; it is highly recommended to set the MPLCONFIGDIR environment variable to a writable directory, in particular to speed up the import of Matplotlib and to better support multiprocessing.

Configurations:
BACKBONE                       resnet101
BACKBONE_STRIDES               [4, 8, 16, 32, 64]
BATCH_SIZE                     1
BBOX_STD_DEV                   [0.1 0.1 0.2 0.2]
COMPUTE_BACKBONE_SHAPE         None
DETECTION_MAX_INSTANCES        100
DETECTION_MIN_CONFIDENCE       0.7
DETECTION_NMS_THRESHOLD        0.3
FPN_CLASSIF_FC_LAYERS_SIZE     1024
GPU_COUNT                      1
GRADIENT_CLIP_NORM             5.0
IMAGES_PER_GPU                 1
IMAGE_CHANNEL_COUNT            3
IMAGE_MAX_DIM                  512
IMAGE_META_SIZE                14
IMAGE_MIN_DIM                  800
IMAGE_MIN_SCALE                0
IMAGE_RESIZE_MODE              square
IMAGE_SHAPE                    [512 512   3]
LEARNING_MOMENTUM              0.9
LEARNING_RATE                  0.001
LOSS_WEIGHTS                   {'rpn_class_loss': 1.0, 'rpn_bbox_loss': 1.0, 'mrcnn_class_loss': 1.0, 'mrcnn_bbox_loss': 1.0, 'mrcnn_mask_loss': 1.0}
MASK_POOL_SIZE                 14
MASK_SHAPE                     [28, 28]
MAX_GT_INSTANCES               100
MEAN_PIXEL                     [154.55678101 174.08623919 175.05516608]
MINI_MASK_SHAPE                (56, 56)
NAME                           maia_training
NUM_CLASSES                    2
POOL_SIZE                      7
POST_NMS_ROIS_INFERENCE        1000
POST_NMS_ROIS_TRAINING         2000
PRE_NMS_LIMIT                  6000
ROI_POSITIVE_RATIO             0.33
RPN_ANCHOR_RATIOS              [0.5, 1, 2]
RPN_ANCHOR_SCALES              (32, 64, 128, 256, 512)
RPN_ANCHOR_STRIDE              1
RPN_BBOX_STD_DEV               [0.1 0.1 0.2 0.2]
RPN_NMS_THRESHOLD              0.85
RPN_TRAIN_ANCHORS_PER_IMAGE    256
STEPS_PER_EPOCH                2000
TOP_DOWN_PYRAMID_SIZE          256
TRAIN_BN                       False
TRAIN_ROIS_PER_IMAGE           200
USE_MINI_MASK                  True
USE_RPN_ROIS                   True
VALIDATION_STEPS               0
WEIGHT_DECAY                   0.0001


WARNING:tensorflow:From /usr/local/lib/python3.8/dist-packages/tensorflow/python/ops/array_ops.py:5043: calling gather (from tensorflow.python.ops.array_ops) with validate_indices is deprecated and will be removed in a future version.
Instructions for updating:
The `validate_indices` argument has no effect. Indices are always validated on CPU and never validated on GPU.
2023-01-16 11:33:19.091053: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcuda.so.1
2023-01-16 11:33:19.101151: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-16 11:33:19.101293: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:02:00.0 name: NVIDIA A16-8Q computeCapability: 8.6
coreClock: 1.755GHz coreCount: 10 deviceMemorySize: 7.83GiB deviceMemoryBandwidth: 186.29GiB/s
2023-01-16 11:33:19.101333: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
#(...skip)
2023-01-16 11:33:19.107243: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-16 11:33:19.107313: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2023-01-16 11:33:21.482472: I tensorflow/core/platform/cpu_feature_guard.cc:142] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX512F FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-01-16 11:33:21.483477: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:937] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2023-01-16 11:33:21.483624: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1733] Found device 0 with properties: 
pciBusID: 0000:02:00.0 name: NVIDIA A16-8Q computeCapability: 8.6
coreClock: 1.755GHz coreCount: 10 deviceMemorySize: 7.83GiB deviceMemoryBandwidth: 186.29GiB/s
#(...skip)
2023-01-16 11:33:21.483874: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1871] Adding visible gpu devices: 0
2023-01-16 11:33:21.483975: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudart.so.11.0
2023-01-16 11:33:21.968183: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1258] Device interconnect StreamExecutor with strength 1 edge matrix:
2023-01-16 11:33:21.968246: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1264]      0 
2023-01-16 11:33:21.968256: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1277] 0:   N 
#(...skip)
2023-01-16 11:33:21.968824: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1418] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 5946 MB memory) -> physical GPU (device: 0, name: NVIDIA A16-8Q, pci bus id: 0000:02:00.0, compute capability: 8.6)
2023-01-16 11:33:22.353675: I tensorflow/core/platform/profile_utils/cpu_utils.cc:114] CPU Frequency: 2095074999 Hz
Train step:  {'layers': 'heads', 'epochs': '10', 'learning_rate': '0.001'}

Starting at epoch 0. LR=0.001

Checkpoint Path: /var/www/storage/maia_jobs/maia-19-instance-segmentation/models/maia_training20230116T1133/mask_rcnn_maia_training_{epoch:04d}.h5
Selecting layers to train
fpn_c5p5               (Conv2D)
fpn_c4p4               (Conv2D)
fpn_c3p3               (Conv2D)
fpn_c2p2               (Conv2D)
fpn_p5                 (Conv2D)
fpn_p2                 (Conv2D)
fpn_p3                 (Conv2D)
fpn_p4                 (Conv2D)
rpn_model              (Functional)
mrcnn_mask_conv1       (TimeDistributed)
mrcnn_mask_bn1         (TimeDistributed)
mrcnn_mask_conv2       (TimeDistributed)
mrcnn_mask_bn2         (TimeDistributed)
mrcnn_class_conv1      (TimeDistributed)
mrcnn_class_bn1        (TimeDistributed)
mrcnn_mask_conv3       (TimeDistributed)
mrcnn_mask_bn3         (TimeDistributed)
mrcnn_class_conv2      (TimeDistributed)
mrcnn_class_bn2        (TimeDistributed)
mrcnn_mask_conv4       (TimeDistributed)
mrcnn_mask_bn4         (TimeDistributed)
mrcnn_bbox_fc          (TimeDistributed)
mrcnn_mask_deconv      (TimeDistributed)
mrcnn_class_logits     (TimeDistributed)
mrcnn_mask             (TimeDistributed)
/usr/local/lib/python3.8/dist-packages/tensorflow/python/keras/optimizer_v2/optimizer_v2.py:374: UserWarning: The `lr` argument is deprecated, use `learning_rate` instead.
  warnings.warn(
Epoch 1/10
/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/indexed_slices.py:447: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("training/SGD/gradients/gradients/roi_align_classifier/concat_grad/sub:0", shape=(None,), dtype=int32), values=Tensor("training/SGD/gradients/gradients/roi_align_classifier/concat_grad/GatherV2_2:0", shape=(None, 7, 7, 256), dtype=float32), dense_shape=Tensor("training/SGD/gradients/gradients/roi_align_classifier/concat_grad/Shape:0", shape=(4,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/indexed_slices.py:447: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("training/SGD/gradients/gradients/roi_align_classifier/concat_grad/sub_1:0", shape=(None,), dtype=int32), values=Tensor("training/SGD/gradients/gradients/roi_align_classifier/concat_grad/GatherV2_5:0", shape=(None, 7, 7, 256), dtype=float32), dense_shape=Tensor("training/SGD/gradients/gradients/roi_align_classifier/concat_grad/Shape_1:0", shape=(4,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
  warnings.warn(
/usr/local/lib/python3.8/dist-packages/tensorflow/python/framework/indexed_slices.py:447: UserWarning: Converting sparse IndexedSlices(IndexedSlices(indices=Tensor("training/SGD/gradients/gradients/roi_align_classifier/concat_grad/sub_2:0", shape=(None,), dtype=int32), values=Tensor("training/SGD/gradients/gradients/roi_align_classifier/concat_grad/GatherV2_8:0", shape=(None, 7, 7, 256), dtype=float32), dense_shape=Tensor("training/SGD/gradients/gradients/roi_align_classifier/concat_grad/Shape_2:0", shape=(4,), dtype=int32))) to a dense Tensor of unknown shape. This may consume a large amount of memory.
#(...skip) #Lots of warning about consuming memory ................................................... 
warnings.warn(
2023-01-16 11:33:48.603922: W tensorflow/core/grappler/costs/op_level_cost_estimator.cc:690] Error in PredictCost() for the op: op: "CropAndResize" attr { key: "T" value { type: DT_FLOAT } } attr { key: "extrapolation_value" value { f: 0 } } attr { key: "method" value { s: "bilinear" } } inputs { dtype: DT_FLOAT shape { dim { size: -538 } dim { size: 56 } dim { size: 56 } dim { size: 1 } } } inputs { dtype: DT_FLOAT shape { dim { size: -25 } dim { size: 4 } } } inputs { dtype: DT_INT32 shape { dim { size: -25 } } } inputs { dtype: DT_INT32 shape { dim { size: 2 } } value { dtype: DT_INT32 tensor_shape { dim { size: 2 } } tensor_content: "\034\000\000\000\034\000\000\000" } } device { type: "GPU" vendor: "NVIDIA" model: "NVIDIA A16-8Q" frequency: 1755 num_cores: 10 environment { key: "architecture" value: "8.6" } environment { key: "cuda" value: "11020" } environment { key: "cudnn" value: "8100" } num_registers: 65536 l1_cache_size: 24576 l2_cache_size: 2097152 shared_memory_size_per_multiprocessor: 102400 memory_size: 6235095040 bandwidth: 200032000 } outputs { dtype: DT_FLOAT shape { dim { size: -25 } dim { size: 28 } dim { size: 28 } dim { size: 1 } } }

2023-01-16 11:33:49.805924: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcudnn.so.8
2023-01-16 11:33:51.366393: I tensorflow/stream_executor/cuda/cuda_dnn.cc:359] Loaded cuDNN version 8100
2023-01-16 11:33:53.904733: E tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-01-16 11:33:53.904790: W tensorflow/stream_executor/gpu/asm_compiler.cc:56] Couldn't invoke ptxas --version
2023-01-16 11:33:53.905217: E tensorflow/core/platform/default/subprocess.cc:304] Start cannot spawn child process: No such file or directory
2023-01-16 11:33:53.905278: W tensorflow/stream_executor/gpu/redzone_allocator.cc:314] Internal: Failed to launch ptxas
Relying on driver to perform ptx compilation. 
Modify $PATH to customize ptxas location.
This message will be only logged once.
2023-01-16 11:33:54.586180: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublas.so.11
2023-01-16 11:33:55.234035: I tensorflow/stream_executor/platform/default/dso_loader.cc:53] Successfully opened dynamic library libcublasLt.so.11
2023-01-16 11:34:31.487847: I tensorflow/stream_executor/cuda/cuda_blas.cc:1838] TensorFloat-32 will be used for the matrix multiplication. This will only be logged once.

   1/2000 [..............................] - ETA: 41:48:14 - batch: 0.0000e+00 - size: 1.0000 - loss: 2.3971 - rpn_class_loss: 0.0316 - rpn_bbox_loss: 0.3011 - mrcnn_class_loss: 0.5049 - mrcnn_bbox_loss: 0.9590 - mrcnn_mask_loss: 0.6005
#(skip>>>)
1165/2000 [================>.............] - ETA: 16:55:24 - batch: 582.0000 - size: 1.0000 - loss: 1.7742 - rpn_class_loss: 0.0929 - rpn_bbox_loss: 0.7447 - mrcnn_class_loss: 0.1698 - mrcnn_bbox_loss: 0.3682 - mrcnn_mask_loss:

1166/2000 [================>.............] - ETA: 16:58:40 - batch: 582.5000 - size: 1.0000 - loss: 1.7733 - rpn_class_loss: 0.0928 - rpn_bbox_loss: 0.7440 - mrcnn_class_loss: 0.1697 - mrcnn_bbox_loss: 0.3681 - mrcnn_mask_loss: 0.3986Floating point exception (core dumped)

System log in ubuntu /var/log/kern.log

Jan 17 11:20:41 seaimage kernel: [86417.907979] traps: python3[15676] trap divide error ip:7fed9f365988 sp:7fef2e7b23c0 error:0 in libcudnn_cnn_infer.so.8.1.0[7fed9462f000+2d112000]

5 replies

mzur Jan 18, 2023
Maintainer

Sorry, I've got no clue. Our way forward is to migrate to PyTorch to leave the old versions of TensorFlow behind. Hopefully we can ship this in the next few weeks.

cywhale Jan 18, 2023
Author

Thanks. That's great and looking forward to this solution^^

mzur Jan 18, 2023
Maintainer

Keep an eye on biigle/maia#96 and you'll be notified when this is ready 😉

mzur Feb 2, 2023
Maintainer

MAIA v2.0.0 is now released which replaced TensorFlow with PyTorch. Take a look at the release notes.

If you want to upgrade, be sure to pull the latest changes from biigle/biigle:gpu and also update the build/.env file.

cywhale Feb 2, 2023
Author

Such a milestone! I'll definitely upgrade to this release to try. Thanks for sharing this.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BIIGLE

MAIA encounter problems about python package requirements in gpu-worker dockerfile #533

{{title}}

Replies: 3 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

BIIGLE

MAIA encounter problems about python package requirements in gpu-worker dockerfile #533

cywhale Jan 12, 2023

Replies: 3 comments · 8 replies

mzur Jan 12, 2023 Maintainer

cywhale Jan 13, 2023 Author

mzur Jan 13, 2023 Maintainer

cywhale Jan 13, 2023 Author

mzur Jan 13, 2023 Maintainer

cywhale Jan 17, 2023 Author

mzur Jan 18, 2023 Maintainer

cywhale Jan 18, 2023 Author

mzur Jan 18, 2023 Maintainer

mzur Feb 2, 2023 Maintainer

cywhale Feb 2, 2023 Author

cywhale
Jan 12, 2023

Replies: 3 comments 8 replies

mzur
Jan 12, 2023
Maintainer

cywhale
Jan 13, 2023
Author

mzur Jan 13, 2023
Maintainer

cywhale Jan 13, 2023
Author

mzur Jan 13, 2023
Maintainer

cywhale
Jan 17, 2023
Author

mzur Jan 18, 2023
Maintainer

cywhale Jan 18, 2023
Author

mzur Jan 18, 2023
Maintainer

mzur Feb 2, 2023
Maintainer

cywhale Feb 2, 2023
Author