-
Notifications
You must be signed in to change notification settings - Fork 152
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test_gradient_based_solver fails #22
Comments
I just found out that clinfo shows that I have two devices (GPUs?) although physically I have one. Maybe that is the reason for the FAILED tests above. But how to force the tests to use only one device?
|
AMD platform includes CPU and GPU devices. It’s up to the application to choose the appropriate device, but, I imagine, it should be defaulting to the GPU device. From: Marcin Pękalski [mailto:[email protected]] I just found out that clinfo shows that I have two devices (GPUs?) although physically I have one. Maybe that is the reason for the FAILED tests above. But how to force the tests to use only one device? $ clinfo Number of platforms: 1 Platform Profile: FULL_PROFILE Platform Version: OpenCL 2.0 AMD-APP (1800.8) Platform Name: AMD Accelerated Parallel Processing Platform Vendor: Advanced Micro Devices, Inc. Platform Extensions: cl_khr_icd cl_amd_event_callback cl_amd_offline_devices Platform Name: AMD Accelerated Parallel Processing Number of devices: 2 Device Type: CL_DEVICE_TYPE_GPU — |
Is it possible to limit number of visible devices by setting some env variable? |
Bystander observation: I would think that it would be better to choose specifically which GPU to use, rather than to choose how many GPUs to use. |
I suspect that Marcin only has a single GPU, but there are two devices: CPU + GPU. Marcin, If you want to disable the CPU device, you can set the environment variable CPU_MAX_COMPUTE_UNITS to 0, but I don’t think it will fix your problem. Can you make sure you are running the latest drivers from AMD? Version 1800 seems to be about 6 months old. Jeff From: Hugh Perkins [mailto:[email protected]] Bystander observation: I would think that it would be better to choose specifically which GPU to use, rather than to choose how many GPUs to use. — |
I also have this issue. |
Ok guys. Hold on, we will look into this soon. Sent from my iPhone
|
May it be related to some failing tests in clBLAS? |
Well, this time I made sure clBLAS had passed test-functional and test-short before I tried installing caffe, but I just tried running some of those tests again and they don't work anymore. I'm honestly not really sure if this means caffe is breaking clBLAS or if there is some user error on my part here. Initialize OpenCL and clblas... I have another system with a more minimal installation of clBLAS and caffe, i'm going to switch to that and see if clBLAS is still working there. |
This clBLAS error you have there from test-functional has been fixed in the latest develop (PR #214) branch by Timmy Liu. It breaks further on, but the issue is kind of the same that the method returns instead of throwing an error or sth like that. |
Right, I think I was trying to test the wrong version of clBLAS on that system. Where I am sitting now, I only have the current develop version, and I get this output: ./test-functional [----------] 136 tests from QUEUE (67714 ms total) [----------] Global test environment tear-down I don't recall any of these failing the first time I ran this test. I'm trying to reinstall clBLAS and now i'm getting this issue: Linking Fortran executable ../staging/test-correctness So, I think what I have now is similar to these issues: I remember the clBLAS test-functional and test-short worked before I installed caffe, but I installed more blas libraries between when the clBLAS test worked, and when I did the runtest for caffe. Atlas for example. This clBLAS issue seems to be caused by a conflict between different blas libraries, so I think getting those dependencies together for caffe after installing clBLAS might have introduced some sort of conflict. Right now my plan is to just go into the clBLAS cmake files and try to make sure it's getting the same libblas.so it originally used when installed the first time. The fact that introducing new blas libraries after installing clBLAS seems to have broken it retroactively seems to spell trouble though. I think what really needs to be done is the amount of different BLAS libraries necessary to install OpenCL-caffe and it's dependencies needs to be minimized. Given that OpenCL caffe needs an BLAS external to clBLAS, I suppose I should have tried to use the same one I used to install clBLAS, and then maybe this could have been avoided. |
Same thing here. R9 270X with Xeon 1241 |
Just installed ubuntu 15.10 |
There is also guy @doonny in original caffe issue with same issue running on W9100 |
I'm having a similar issue running on a W9100, I'm able to run the built in lenet training script but am unable to run anything like 'caffe train -solver etc' |
Uhh? Maybe fix? No? |
Is your setup still not working? |
Haven't tested since last time. Don't think something changed. |
thanks for letting us know about this issue. Past two weeks are my holidays Junli On Fri, Feb 12, 2016 at 12:02 AM, sliterok [email protected] wrote:
Junli Gu--谷俊丽 |
I managed to get past the following errors:
My solution might be way too hacky, but it works. _Context: _ I installed blas using: I observed that new directory - I opened the Pl confirm if this is reproducible, I would like to make my first ever PR :) PS: I do not understand the code upside down. I have no idea why there are different signatures for different platforms. Thats why I mentioned solution as hack. Regards, |
Bump? |
I have a problem with make runtest failing on SGDSolver and NesterovSolver. I looked at the git repository of BVLC/caffe (BVLC/caffe#3109) and there somebody was referring to a problem coming from the same file test_gradient_based_solver.cpp. In the comments people were writing that it was caused by multiple GPUs present in the system or the fact that Intel MKL's float point operations (such as matrix multiplication) are non-deterministic by default.
Regarding my system, I am running Caffe cloned from github on 22nd of December 2015 on Ubuntu 15.10 with Radeon R9 290 (4GB) and i7-4770K CPU @ 3.50GHz, AMDAPPSDK-3.0. Four tests failed.
If anybody knows how to make them pass or what causes the problem it would be great.
The text was updated successfully, but these errors were encountered: