-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: support indexing parray by ndarray #89
base: main
Are you sure you want to change the base?
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other than efficiency lgtm,
""" | ||
if isinstance(value, PArray): | ||
value = value.array | ||
|
||
if isinstance(slices, numpy.ndarray) or isinstance(slices, cupy.ndarray): | ||
slices = slices.tolist() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Isn't this inefficient as:
- numpy supports indexing by both list and np.ndarray of int32/int64/uint32/uint64 type
- cupy support indexing by list, and both cp.ndarray/np.ndarray ofint32/int64/uint32/uint64 type
This could lead to a lot of device -> host copies from the crosspy calls?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could only avoid by caller since this information has to be saved in cpu side that is becase it will later be converted and hashed so it could be recorded for further use within parray. (so indexing by cupy array is not efficient in any senario since parray's internal data structure is still running in cpu)
And I think crosspy should always use numpy array as index, what is the benefit of using cupy array as index? @bozhiyou
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And I think crosspy should always use numpy array as index
No, indices could be as large as the array itself - think of the permutation example arr[shuffle(arange(len(arr))]
- and even larger. If you are indexing a cupy array, the indices array has to be copied to GPU as a cupy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with Will that list will be inefficient. Storing all the information on CPU seems to be inefficient as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, but currently there is no efficient way to deal with this issue like simply storing the info in GPU or as ndarray (actually all lists will be converted to hash maps in the later step since there is no gurantee the slice mapping is in sequential when you deal with local indices). And each access to array need to query these information (e.g. comparing slices with given slices). Consider the following scenario:
- Data initialy stored in cpu then moved to gpu, but user don't know which device it is, should the user index with cp.ndarray or by np.ndarray?
- Same as 1 but now in a multidevice task like where crosspy is in, which slices should user use to index into it? (and which device should it be if the cupy.ndarray is used?)
- Which device should the cupy ndarray be stored inside parray?
(a) If we chose to put it on the same device as where data is, when there are multiple copy on different device, each device now has part of the slicing saved, every access regardless where the data is has to access all devices and compare slice information that saved locally, that will be really slow.
(b) if we chose to put it on the first gpu, operation happened for copy on other devices has to query this gpu to, that is the same as query data in cpu since nvlink is not guranted to be exist between all device and this request will be routed via cpu, which is slower than put it on cpu. This also raise memory issues when the slices is large (you mentioned it might be as large as array), GPU memory are more limited than cpu and runtime has to track the slice size to overwise new data moved there will be OOM, and this also cause imbalanced workload and memory size between devices. - when data is moved, should we move the slices mapping together? Will this increase the data movement overhead?
until we found a good solution to the above issues, CPU is still the best choice to store the slices. (but I could make it numpy array instead of python list)
Also fixed a minor bug in parray.update()