-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
DARTS scheduler assert failure #40
Comments
I have broken this problem down to a simple example:
Creating
|
Even simpler example when DARTS fails with asserts:
|
@MaximeGonthier: could you have a look? The problem is not calling |
(@Muxas' testcases can be really interesting for DARTS, and most starpu testsuite failures and currently due to this issue) |
Thank you for the examples. I'll work on it as soon as possible. I'll keep you updated |
@Muxas The |
Trying latest 882aba682cec925bf6bd226c210641bd80b0795d commit shows there is still a problem. I tried this example:
Here is a config.log The error itself:
Full backtrace:
|
Ah, thanks! I'm working on it. |
@MaximeGonthier does DARTS pay attention to |
@MaximeGonthier probably the case that DARTS was never tried against was |
DARTS uses STARPU_TASK_GET_MODE only to ignore data of type STARPU_SCRATCH and STARPU_REDUX, nothing else.
Yes ok I see I'll work on that, thanks |
See the backtrace: DARTS is calling |
Yes I figured that out just now that
Yes I agree. Do you think I should check |
It depends what you are doing with the data. If e.g. you record where a piece of data will be (to tend to put tasks that will read it there), you'll still want to see that even if the mode is |
@Muxas the issue is fixed in the last commit from master (eedd54ac). The simple example presented above is now working |
bb2b4a18 is a better version of the fix |
I checked it on my side. Provided examples (presented above in this issue) work, but my software still fails with assert errors. I tried master branch of GitLab remote, commit bb2b4a186b9c612aac499f6e9fcdf93f8c906d76. Here is config.log error:
Backtrace:
|
Is there a larger working example of your software, or can I easily try your software directly? |
Well, I was going to create a |
The software is NNTile. You can try to follow instructions from a Dockerfile to build it. After NNTile is built, I will provide an example script to run it. |
Alternatively, you can follow README of NNTile:
|
@MaximeGonthier for a start you can run |
Hi, Sorry for leaving your issue pending @Muxas, I was pretty busy the last few months but have more time now :) Would you be able to try your code with the latest commit from starpu? In the mean time I'll also try to set up your software on our cluster to test it with darts Thank you |
Hi! I tried latest commit 23fd79193ceccd1ad5223e510110d3e320ebf160 of https://gitlab.inria.fr/starpu/starpu and get the same error:
|
Probably, the fact that your fixes did not solve the problem sounds depressing, but the first stage (forward pass) of training NNs now works! During the first phase neural network generates lots of temporary data, that is used only in second stage (backward pass). |
That is good news indeed :) |
Do you have the trace/full error message? I would like to check if this is also caused by |
@MaximeGonthier the same kind of issue is visible with the
It indeed seems doubtful to call |
I agree for not associating the task with these data however it did not solve the issue. The list of data we are looping over is So maybe what's happening is that the data is used for a RW, and the task using it in read have been completed and we are just left with task using it as a W, and thus we don't need to check this data at all, and it would probably fix the issue. |
Hi @Muxas @sthibaul, I've pushed in the latest commit (530b50da) a fix to allow darts to pass the tests from Hopefully it should also fix the |
@MaximeGonthier Hi! The first impression is the error is gone. Now the second phase (so-called backward pass) of training a neural networks also works! But the last phase (updating weights of a neural network) returns an error like:
It happens in the end of execution during clean up. Here is a backtrace:
|
Running another time I get:
It seems that DARTS set something to NULL and the NULL value is then freed. Here is a backtrace:
|
Hi, Darts is not yet very good at balancing work between GPUs and CPUs and is thus often used in homogeneous settings. Disabling the CPUs and OPENCL allows that. You can use STARPU_NCUDA to adjust how many GPUs you will be using. |
Fully disabling CPU workers is not possible yet, as data preparation code is not yet ported to GPU. However, once data is established training loop only uses GPUs while CPU workers do nothing.
This setup did not solve the problem for 4 GPUs: forward passes get 6 Tflops/s (up from 5 Tflops/s) and backward passes stale. |
Steps to reproduce
I am trying to use DARTS scheduler from the latest master branch (commit 4131e05d441f6aa3004632c61e982c63f2496cb9 of Gitlab) and get the following error:
Full backtrace and config.log are here
At the same time, other schedulers, e.g, DMDASD, work without a problem
The text was updated successfully, but these errors were encountered: