Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: kernels for random.vonmisses; part 2 #681

Draft
wants to merge 17 commits into
base: master
Choose a base branch
from

Conversation

samir-nasibli
Copy link
Contributor

@samir-nasibli samir-nasibli commented Apr 14, 2021

Description

Enable computations on devices [CPU/GPU].

Tests

  • DPNP own:
tests/test_random.py::TestDistributionsVonmises::test_moments[large_kappa] PASSED
tests/test_random.py::TestDistributionsVonmises::test_moments[small_kappa] PASSED
tests/test_random.py::TestDistributionsVonmises::test_invalid_args PASSED
tests/test_random.py::TestDistributionsVonmises::test_seed[large_kappa] FAILED
tests/test_random.py::TestDistributionsVonmises::test_seed[small_kappa] FAILED
  • + numpy external

TODO

tests/test_random.py::TestDistributionsVonmises::test_seed failed on both devices. Bug.

@samir-nasibli samir-nasibli added the in progress Please do not merge. Work is in progress. label Apr 14, 2021
@samir-nasibli samir-nasibli removed the in progress Please do not merge. Work is in progress. label Apr 14, 2021
@samir-nasibli samir-nasibli added the in progress Please do not merge. Work is in progress. label Apr 17, 2021
@@ -1242,65 +1243,70 @@ void dpnp_rng_vonmises_large_kappa_c(void* result, const _DataType mu, const _Da

Uvec = reinterpret_cast<_DataType*>(dpnp_memory_alloc_c(size * sizeof(_DataType)));
Vvec = reinterpret_cast<_DataType*>(dpnp_memory_alloc_c(size * sizeof(_DataType)));

for (size_t n = 0; n < size;)
n = reinterpret_cast<size_t*>(dpnp_memory_alloc_c(sizeof(size_t)));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is quite strange (Make scalar as a array with one element).
I think it should be a scalar, not an array.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scalar and this is the same. You just can not pass n to sycl region in other way.

Y = 0.0;
else if (Y > 1.0)
Y = 1.0;
n[0] = n[0] + 1;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a mistake. This is parallel environment (SYCL kernel). Writing inside the kernel into same memory cause https://en.wikipedia.org/wiki/Race_condition

V = Vvec[i];
sn2 = sn * sn;
cn2 = cn * cn;
auto paral_kernel_some = [&](cl::sycl::handler& cgh) {
Copy link
Contributor

@shssf shssf May 13, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kernel inside the loop with bigger trip count. It would be more efficient to parallelize (make kernel) the algorithm by bigger value size instead size-n. So, it will require a loop inside the kernel.
It is questionable what will be more performant

  1. loop with a kernels queue (data dependent)
  2. kernel with a loop

It is hard to predict it with no perf measurements but I would vote that parallelization with bigger number of threads should be better.

@shssf
Copy link
Contributor

shssf commented Jun 22, 2021

@samir-nasibli Is this PR ready to review or still in development stage?

@samir-nasibli
Copy link
Contributor Author

@samir-nasibli Is this PR ready to review or still in development stage?

I will update this PR or move some part of this changes to another PR with closing this.

@samir-nasibli samir-nasibli changed the title ENH: kernels for random.vonmisses ENH: kernels for random.vonmisses; part 2 Jul 12, 2021
Comment on lines 1314 to 1315
*n = *n + 1;
result1[*n] = cl::sycl::asin(cl::sycl::sqrt(Y));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like here we are getting race condition, that is why we are getting wrong results. To prevent it we should calculate n (index of result) from i.

Comment on lines 1400 to 1401
*n = *n + 1;
result1[*n] = cl::sycl::acos(W);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like here we are getting race condition, that is why we are getting wrong results. To prevent it we should calculate n (index of result) from i.

* Fix race condition in dpnp_rng_vonmises_small_kappa_c and dpnp_rng_vonmises_large_kappa_c

* Rename arrays and change if condition from kernels in dpnp_rng_vonmises_large_kappa_c and dpnp_rng_vonmises_small_kappa_c

* Add space

* Fix indices in dpnp_rng_vonmises_small_kappa_c and dpnp_rng_vonmises_large_kappa_c
@samir-nasibli
Copy link
Contributor Author

samir-nasibli commented Oct 7, 2021

@LukichevaPolina
The use of extra memory with the amount of data is not a good practice in optimization. We must avoid this cases.
We have to remove the possibilities for a potential race condition in the algorithm.

@densmirn
Copy link
Contributor

densmirn commented Oct 7, 2021

The use of extra memory with the amount of data is not a good practice in optimization. We must avoid this cases. - ideas?
We have to remove the possibilities for a potential race condition in the algorithm. - done

@samir-nasibli
Copy link
Contributor Author

The use of extra memory with the amount of data is not a good practice in optimization. We must avoid this cases. - ideas?
We have to remove the possibilities for a potential race condition in the algorithm. - done

  1. Any optimization with the use of additional memory can actually degrade (depending on the input data) and underestimate all the benefits from parallelism. Allocation/Deallocation/Working with additional memory is expensive.

  2. We have to remove the possibilities for a potential race condition in the algorithm. - done

using extra mem is brute force approach.

  1. - ideas?

We need to investigate it.

  1. I also recommend to use perf tests and profiler tools during optimization. Comparative analysis is important in such work.

@oleksandr-pavlyk
Copy link
Contributor

Stale PR?

@antonwolfy antonwolfy marked this pull request as draft February 20, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
in progress Please do not merge. Work is in progress.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants