-
Notifications
You must be signed in to change notification settings - Fork 549
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix UMAP issues with large inputs #6245
base: branch-25.04
Are you sure you want to change the base?
Fix UMAP issues with large inputs #6245
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking good! Let's just make sure to IWYU for the new uses of uint64_t. Using C++ types (std::uint64_t
) in our non-CUDA code would be a bonus, but it shouldn't block merge. I've also called out some spots where we could use uniform initialization syntax rather than a bare cast.
@viclafargue That last commit was unsigned. Could you sign it and push that up? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@viclafargue and I discussed this briefly last week, but given the nature of the 64-bit hardcoded changes here, I would like to see at least a small benchmark before this is merged so that we can feel comfortable that this doesn’t have a huge impact on the runtime.
/ok to test |
/ok to test |
/ok to test |
/ok to test |
288d129
to
f7ce445
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks pretty good. The one change that I'm concerned about is with the blocks calculation. Let me know if I'm thinking about that one wrong.
@@ -73,8 +73,8 @@ endfunction() | |||
# To use a different CUVS locally, set the CMake variable | |||
# CPM_cuvs_SOURCE=/path/to/local/cuvs | |||
find_and_configure_cuvs(VERSION ${CUML_MIN_VERSION_cuvs} | |||
FORK rapidsai | |||
PINNED_TAG branch-${CUML_BRANCH_VERSION_cuvs} | |||
FORK divyegala |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just noting that we'll need to make sure to switch these back once the associated PRs have been merged so we don't forget.
@@ -167,7 +168,7 @@ class TSNE_runner { | |||
{ | |||
distance_and_perplexity(); | |||
|
|||
const auto NNZ = COO_Matrix.nnz; | |||
const auto NNZ = (value_idx)COO_Matrix.nnz; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
const auto NNZ = (value_idx)COO_Matrix.nnz; | |
const auto NNZ = value_idx(COO_Matrix.nnz); |
Let's not let it prevent merge, but if we need to make further changes, it would be good to avoid raw casts.
@@ -328,7 +331,8 @@ void launcher(int n, | |||
* Compute graph of membership strengths | |||
*/ | |||
|
|||
dim3 grid_elm(raft::ceildiv(n * n_neighbors, TPB_X), 1, 1); | |||
uint64_t to_process = static_cast<uint64_t>(in.n_rows) * n_neighbors; | |||
dim3 grid_elm(raft::ceildiv(to_process, TPB_X), 1, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believer there's a chance for an overflow here. If to_process
is large enough, we'd get something that would overflow a 32-bit integer, right? We also probably want to limit this to the maximum number of blocks (65535) anyway.
- &treelite treelite==4.4.1 | ||
- &treelite treelite==4.3.0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have to downgrade treelite?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a bad merge from a missing commit in a forward merger when switching target branch from 25.02 to 25.04.
FORK divyegala | ||
PINNED_TAG raft-sparse-updates |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to revert this before merge.
FORK viclafargue | ||
PINNED_TAG fix-sparse-utilities |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reminder to revert this before merge.
find_and_configure_treelite(VERSION 4.3.0 | ||
PINNED_TAG 575e4208f2b18e40d818c338ecb95d7a26e69aab |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the motivation for the downgrade?
@@ -348,16 +350,15 @@ CUML_KERNEL void optimize_batch_kernel(T const* head_embedding, | |||
* @param rounding: Floating rounding factor used to truncate the gradient update for | |||
* deterministic result. | |||
*/ | |||
template <typename T, int TPB_X> | |||
template <typename T, uint64_t TPB_X> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TPB_X is hard-coded to 256, no? Does it make sense to switch its type to uint64_t ?
|
||
raft::sparse::convert::sorted_coo_to_csr(&graph_coo, row_ind.data(), stream); | ||
raft::sparse::linalg::coo_degree(&graph_coo, ia.data(), stream); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So we never needed the degree here in the first place?
@@ -252,14 +252,12 @@ void optimize_layout(T* head_embedding, | |||
|
|||
T rounding = create_gradient_rounding_factor<T>(head, nnz, head_n, alpha, stream_view); | |||
|
|||
MLCommon::FastIntDiv tail_n_fast(tail_n); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This and similar changes might lead to a performance regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't agree more. This is there for a reason. If we need to, we should be allowing a 64-bit int div instead of removing this altogether.
#include <stdint.h> | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't needed as far as I can tell.
@@ -425,7 +427,7 @@ void _transform(const raft::handle_t& handle, | |||
* Compute graph of membership strengths | |||
*/ | |||
|
|||
int nnz = inputs.n * params->n_neighbors; | |||
uint64_t nnz = uint64_t{inputs.n} * params->n_neighbors; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't hardcode these types. Create an nnz_t
.
@@ -44,15 +46,16 @@ using namespace ML; | |||
*/ | |||
template <typename T> | |||
void launcher(const raft::handle_t& handle, | |||
int n, | |||
uint64_t n, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use idx_t
@@ -59,7 +58,8 @@ using namespace ML; | |||
* @param stream cuda stream | |||
*/ | |||
template <typename T> | |||
void make_epochs_per_sample(T* weights, int weights_n, int n_epochs, T* result, cudaStream_t stream) | |||
void make_epochs_per_sample( | |||
T* weights, uint64_t weights_n, int n_epochs, T* result, cudaStream_t stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use an idx_t
and switch that for uint64_t
@@ -170,7 +169,7 @@ T create_rounding_factor(T max_abs, int n) | |||
|
|||
template <typename T> | |||
T create_gradient_rounding_factor( | |||
const int* head, int nnz, int n_samples, T alpha, rmm::cuda_stream_view stream) | |||
const int* head, uint64_t nnz, int n_samples, T alpha, rmm::cuda_stream_view stream) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Creare an nnz_t please
@@ -195,14 +194,14 @@ T create_gradient_rounding_factor( | |||
* positive weights (neighbors in the 1-skeleton) and repelling | |||
* negative weights (non-neighbors in the 1-skeleton). | |||
*/ | |||
template <int TPB_X, typename T> | |||
template <uint64_t TPB_X, typename T> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nnz_t
@@ -301,7 +299,7 @@ template <int TPB_X, typename T> | |||
void launcher( | |||
int m, int n, raft::sparse::COO<T>* in, UMAPParams* params, T* embedding, cudaStream_t stream) | |||
{ | |||
int nnz = in->nnz; | |||
uint64_t nnz = in->nnz; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nnz_t
@@ -96,16 +98,15 @@ DI T truncate_gradient(T const rounding_factor, T const x) | |||
return (rounding_factor + x) - rounding_factor; | |||
} | |||
|
|||
template <typename T, int TPB_X, int n_components> | |||
template <typename T, uint64_t TPB_X, uint64_t n_components> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this really needed?
T const* tail_embedding, | ||
T* tail_buffer, | ||
const MLCommon::FastIntDiv tail_n, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please don't remove FastIntDiv. Make this support uint64_t if needed.
const int* head, | ||
const int* tail, | ||
int nnz, | ||
uint64_t nnz, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use nnz_t
gen.next(r); | ||
int t = r % tail_n; | ||
uint64_t t = r % tail_n; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
define this in a template, please. THis is perf critical.
T const* other = tail_embedding + (k * params.n_components); | ||
uint64_t n_components = params.n_components; | ||
|
||
uint64_t j = head[row]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pleae define a template for this.
int t = r % tail_n; | ||
T const* negative_sample = tail_embedding + (t * params.n_components); | ||
dist_squared = rdist<T>(current, negative_sample, params.n_components); | ||
uint64_t t = r % tail_n; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Template
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issues as raft, but this is hardcoding types in perf critical code. Please use templates- it allows us to quickly switch.
Answers #6204