-
Notifications
You must be signed in to change notification settings - Fork 48
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OpenCL verbose output and documentation, improved auto-tuning scripts, minor fixes after #419 #425
Conversation
…ing scripts. Minor fixes after cp2k#419. * Introduced (runtime-)verbosity level. Print device name (non-zero verbosity). * Fixed issue (cp2k#419 (comment)). * Renamed ACC_OPENCL_VERBOSE to ACC_OPENCL_DEBUG. * ACC benchmark drivers: inform if no device was found. * Improved documentation and documented ACC_OPENCL_VERBOSE. * Introduced verbose output (time needed for kernel compilation, etc). * tune_multiply.py: option to only rely on primary objective. * tune_multiply.py: catch CTRL-C and save configuration. * tune_multiply.sh: relay result code of failing script. * tune_multiply.sh: continuation with wrapper script.
…(accommodate changes from cp2k#419).
…ze in a parallel region which makes this code ineffective.
There will be one more change for this PR, which disables ACC_OPENCL_THREADLOCAL_CONTEXT feature in OpenCL backend. |
I will also try enabling runtime tests for the OpenCL backend on Daint-CI. |
@haampie as a carry-forward from #419, it seems like we have some (unwanted) debug output like:
... which may cause tests to time-out. I have not noticed the "OK m x n x k" output before. |
For some reason, Daint-CI cannot run a test in 900s. @hfp could you try to increase the time limit (1200s?) ? |
Ah, good spot! |
Let me try fixing it as part of this PR. |
It seems to come from tests/libsmm_acc_unittest_multiply.cpp.template, i.e., subsequent calls contained there like |
Not sure if relevant, but these particular tests are executed without the |
Well, no idea.. But I would suggest commenting the output (leave it only for Debug) if it makes the test running faster... Daint-CI has a limited budget for us... |
Really? This is a GPU test, cannot run on the frontend node... We provide srun via cmake, i.e.:
|
Yeah, I was aware, earlier this morning I learned that |
Given that the Maybe add a wrapper for |
Codecov Report
@@ Coverage Diff @@
## develop #425 +/- ##
=======================================
Coverage 63.1% 63.1%
=======================================
Files 86 86
Lines 25625 25625
=======================================
Hits 16190 16190
Misses 9435 9435
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
If my "hack" (reducing output) goes through with Daint-CI, I will leave this to another PR. |
sure, I just opened #427 to track it |
I wonder why the test cannot complete in 900s anymore. This is the last week output: https://object.cscs.ch/v1/AUTH_40b5d92b316940098ceb15cf46fb815e/dbcsr-artifacts/logs/build-679/ The infamous test complete in 629.62s. |
Perhaps a diff between the build-logs can help if this is caused by compiler flags or such. |
Well the only changes since last week and now are due to the HIP PR... |
I just had a look at |
This reverts commit 9598017.
Wait, libnvToolsExt is used? This is the profiler API that can explain the reason for the slowness... Update: the macro __CUDA_PROFILING is not set... it needs a more appropriate analysis... |
I experimented a bit and maybe CUDA MPS is related. At least |
Only CodeCov "fails", i.e., OpenCL runtime tests are now passing with one rank only (see here for some early reasoning). It seems Daint runs the GPUs in "exclusive" mode which seems to only permit multiple ranks via MPS, i.e., creating a second OpenCL context on the same device (single node!) always fails like "device not available". ( As a side-note, it seems running single-ranks omits some of the tests perhaps of too much CMake magic? ) |
wrt CodeCov: the total coverage consists of different runs/uploads (with, w/o mpi; with w/o omp), and the PR entry often happens too early. We could tune it to wait for all uploads before posting or create new posts instead of partially updating the existing one. wrt single-ranks: there is no extra CMake magic involved when it comes to deciding when to run tests, but the following determines the number of ranks at build time:
As mentioned in #427, the |
Thank you for following up, Tiziano!
I wonder if we should drop CodeCov? I think it produces mostly noise (nothing actionable). Primarily, we are happy if the project receives a contribution, secondly, we only take PRs and then review/guide the code and format. I believe this process would never pass something unreasonable into the code base. Overall I have not seen anyone of us judging a contribution by the percentage of coverage. With the current code base, it seems a coverage of ~60% is intrinsic to the project or code base. Perhaps it is even possible to just calculate the percentage (but no threshold making things red and sending emails).
I just noticed that the OpenCL runtime test not only passes (see above for the root cause), but unfortunately ran less tests (19) compared to other (23) runtime tests. The only difference seems to be the rank count. I understand that some tests are currently not running per |
# Conflicts: # .ci/daint.cscs.ch/ocl.build.sh # .ci/daint.cscs.ch/ocl.test.sh
@hfp I assume it is ready for review, I will do it today...
We should keep it, actually I plan to introduce new tests (long project) and add the coverage for the GPU. I like @dev-zero idea...
Should we open a ticket with CSCS? I can try on some other machines, I wonder what's the rationale behind that... I can assume not so many people are using OpenCL.
Interesting... I actually see the same number of tests... |
The "sending emails" part is done automatically by GH when the CodeCov-bot adds the comment. So, I would propose that I change it to wait for all reports before sending it first and changing the relative threshold to something which allows to add some uncovered fixes without error but when adding larger parts they have to be covered, and a lower bound of 60% because that's what we established now.
Are you sure it's not this line here: Line 216 in dcbc5f6
... which should probably be just |
Regarding multiple ranks on a single card, I guess this can be adjusted with the SMI utility and there is likely no performance regression associated with toggling it from exclusive to shared (and MPS can do whatever it does). However, this is reconfiguring Daint and may not be a realistic request. MPS on the other does not support OpenCL (as per some answer in some Nvidia forum). Another option could be to test multiple ranks using a single rank per node and multiple nodes, which should work. |
Thanks for finding, I overlooked this completely! I guess it should stay since these tests are specific to CUDA/HIP and the implementation auto-tuning for this backend. If that was written using the ACC interface it would have been easier, but on the other hand we can play with the OpenTuner based auto-tuning and perhaps it is fruitful to have something else. |
If this passes, I probably want to add an error message explaining a failure when creating the OpenCL context or at least hint the MPS/SMI exclusive thing. |
OK, please let me know when it is time to review... |
@alazzaro the PR is ready for review. |
@hfp For my understanding is the solution proposed here https://gerrit.gromacs.org/c/gromacs/+/5729/ I cannot find anything else on the topic (OpenCL with multiprocesses)... |
I found different stuff including to prefer |
OK, then let's keep a single rank per device and open another issue for further investigation, probably employing the intranode broadcast. It seems that this is a common issue, other codes (CENAERO and SPECFEM3D) have similar problems, but Gromas seems to have found a solution... |
I checked on a Daint/GPU node like
|
They did a broadcast initially and the fix was to discover devices on a per rank basis. We already do the right thing then and it seems to be all about the exclusive mode, but let's investigate over the course of today... |
LGTM, Daint-CI with Intel failed to submit the job. I've restarted the job. So, I suggest opening an issue for the OpenCL and multiple ranks per device, as we discussed (unless you have a very last-minute solution, I would keep the discussion for a different PR). I can have some naive questions (apologies in advance):
|
New tests on Daint-CI wenf all fine... |
I will merge asap.
ACK Thank you for the good questions! I will try answer them one by one (see below).
Of course, it's doable, but likely quite some effort since we never accounted for multiple backends.
Joost quite some time ago (obviously) confirmed SP at least for CPU/LIBXSMM (out of the box), but I cannot remember what/how he tested. Since a lot of code in CP2K is hard-coding DP, it was probably just plain/straight-forward QS test (maybe Water ;-). For SP in general, we probably want to gather workforce and identify if/where this is beneficial and what to enable first. This also touches enabling low/mixed-precision with SP as viable special case. DBCSR stand-alone "workloads" and all (unit-)tests already work with SP. On DBCSR's side, we can think if something else should offered like "internal type" w/ copy-in/out from higher precision, etc. Maybe like the feature about reduced MPI-traffic based on SP.
I have implemented SVM support already, i.e., LIBSMM can leverage calls to some other implementation like MKL. This might be similar to the CUDA/HIP/SMM_ACC (except they need this host-side stack; not sure if other GEMM-args are fed from the host or if this also relies on device-pointers like SVM). So far, this might be useful for (very) large kernels, but otherwise (CUDA/HIP) this is inefficient since the stack is processed on the host with all singular SMMs queued on the device-side. I would rather like to call
At the moment, I believe HIP/OpenCL should deliver the best insight with respect to what this backend is capable of. @haampie wanted to look at some AMD GPUs; perhaps I can help/do this as well (if accessible per my CSCS account). Otherwise (Nvidia devices), the gap using OpenCL vs CUDA is reaching up to 2x with current kernels. I will improve kernels, which are by no means sophisticated yet on the OpenCL side. To get to your question, you can send me an email; I have numbers for P100, V100, A100, and perhaps others. |
Please merge and thanks for the replies! |
OpenCL-BE/LIBSMM: verbose output and documentation. Improved auto-tuning scripts. Minor fixes after #419.