Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix UMAP issues with large inputs #6245

Open
wants to merge 24 commits into
base: branch-25.04
Choose a base branch
from

Conversation

viclafargue
Copy link
Contributor

Answers #6204

Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good! Let's just make sure to IWYU for the new uses of uint64_t. Using C++ types (std::uint64_t) in our non-CUDA code would be a bonus, but it shouldn't block merge. I've also called out some spots where we could use uniform initialization syntax rather than a bare cast.

cpp/include/cuml/common/callback.hpp Outdated Show resolved Hide resolved
cpp/include/cuml/manifold/common.hpp Show resolved Hide resolved
cpp/src/umap/fuzzy_simpl_set/naive.cuh Outdated Show resolved Hide resolved
cpp/src/tsne/tsne_runner.cuh Show resolved Hide resolved
cpp/src/umap/fuzzy_simpl_set/naive.cuh Outdated Show resolved Hide resolved
cpp/src/umap/knn_graph/algo.cuh Outdated Show resolved Hide resolved
cpp/src/umap/simpl_set_embed/algo.cuh Show resolved Hide resolved
Copy link

copy-pr-bot bot commented Jan 23, 2025

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@wphicks
Copy link
Contributor

wphicks commented Jan 23, 2025

@viclafargue That last commit was unsigned. Could you sign it and push that up?

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@viclafargue and I discussed this briefly last week, but given the nature of the 64-bit hardcoded changes here, I would like to see at least a small benchmark before this is merged so that we can feel comfortable that this doesn’t have a huge impact on the runtime.

@divyegala divyegala requested a review from a team as a code owner February 3, 2025 18:42
@divyegala divyegala requested a review from vyasr February 3, 2025 18:42
@github-actions github-actions bot added the CMake label Feb 3, 2025
@divyegala divyegala changed the base branch from branch-25.02 to branch-25.04 February 3, 2025 18:43
@divyegala divyegala requested a review from a team as a code owner February 3, 2025 18:43
@divyegala
Copy link
Member

/ok to test

@divyegala
Copy link
Member

/ok to test

@divyegala
Copy link
Member

/ok to test

@divyegala
Copy link
Member

/ok to test

@viclafargue viclafargue requested review from a team as code owners February 5, 2025 18:30
@github-actions github-actions bot added the Cython / Python Cython or Python issue label Feb 5, 2025
@viclafargue viclafargue force-pushed the fix-umap-large-inputs branch from 288d129 to f7ce445 Compare February 5, 2025 18:43
@github-actions github-actions bot added the conda conda issue label Feb 5, 2025
Copy link
Contributor

@wphicks wphicks left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks pretty good. The one change that I'm concerned about is with the blocks calculation. Let me know if I'm thinking about that one wrong.

@@ -73,8 +73,8 @@ endfunction()
# To use a different CUVS locally, set the CMake variable
# CPM_cuvs_SOURCE=/path/to/local/cuvs
find_and_configure_cuvs(VERSION ${CUML_MIN_VERSION_cuvs}
FORK rapidsai
PINNED_TAG branch-${CUML_BRANCH_VERSION_cuvs}
FORK divyegala
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just noting that we'll need to make sure to switch these back once the associated PRs have been merged so we don't forget.

@@ -167,7 +168,7 @@ class TSNE_runner {
{
distance_and_perplexity();

const auto NNZ = COO_Matrix.nnz;
const auto NNZ = (value_idx)COO_Matrix.nnz;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const auto NNZ = (value_idx)COO_Matrix.nnz;
const auto NNZ = value_idx(COO_Matrix.nnz);

Let's not let it prevent merge, but if we need to make further changes, it would be good to avoid raw casts.

@@ -328,7 +331,8 @@ void launcher(int n,
* Compute graph of membership strengths
*/

dim3 grid_elm(raft::ceildiv(n * n_neighbors, TPB_X), 1, 1);
uint64_t to_process = static_cast<uint64_t>(in.n_rows) * n_neighbors;
dim3 grid_elm(raft::ceildiv(to_process, TPB_X), 1, 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believer there's a chance for an overflow here. If to_process is large enough, we'd get something that would overflow a 32-bit integer, right? We also probably want to limit this to the maximum number of blocks (65535) anyway.

- &treelite treelite==4.4.1
- &treelite treelite==4.3.0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we have to downgrade treelite?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bad merge from a missing commit in a forward merger when switching target branch from 25.02 to 25.04.

Comment on lines +76 to +77
FORK divyegala
PINNED_TAG raft-sparse-updates
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to revert this before merge.

Comment on lines +75 to +76
FORK viclafargue
PINNED_TAG fix-sparse-utilities
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reminder to revert this before merge.

Comment on lines +82 to +83
find_and_configure_treelite(VERSION 4.3.0
PINNED_TAG 575e4208f2b18e40d818c338ecb95d7a26e69aab
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the motivation for the downgrade?

@@ -348,16 +350,15 @@ CUML_KERNEL void optimize_batch_kernel(T const* head_embedding,
* @param rounding: Floating rounding factor used to truncate the gradient update for
* deterministic result.
*/
template <typename T, int TPB_X>
template <typename T, uint64_t TPB_X>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TPB_X is hard-coded to 256, no? Does it make sense to switch its type to uint64_t ?


raft::sparse::convert::sorted_coo_to_csr(&graph_coo, row_ind.data(), stream);
raft::sparse::linalg::coo_degree(&graph_coo, ia.data(), stream);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we never needed the degree here in the first place?

@@ -252,14 +252,12 @@ void optimize_layout(T* head_embedding,

T rounding = create_gradient_rounding_factor<T>(head, nnz, head_n, alpha, stream_view);

MLCommon::FastIntDiv tail_n_fast(tail_n);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This and similar changes might lead to a performance regression.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't agree more. This is there for a reason. If we need to, we should be allowing a 64-bit int div instead of removing this altogether.

Comment on lines +25 to +26
#include <stdint.h>

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't needed as far as I can tell.

@@ -425,7 +427,7 @@ void _transform(const raft::handle_t& handle,
* Compute graph of membership strengths
*/

int nnz = inputs.n * params->n_neighbors;
uint64_t nnz = uint64_t{inputs.n} * params->n_neighbors;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't hardcode these types. Create an nnz_t.

@@ -44,15 +46,16 @@ using namespace ML;
*/
template <typename T>
void launcher(const raft::handle_t& handle,
int n,
uint64_t n,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use idx_t

@@ -59,7 +58,8 @@ using namespace ML;
* @param stream cuda stream
*/
template <typename T>
void make_epochs_per_sample(T* weights, int weights_n, int n_epochs, T* result, cudaStream_t stream)
void make_epochs_per_sample(
T* weights, uint64_t weights_n, int n_epochs, T* result, cudaStream_t stream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use an idx_t and switch that for uint64_t

@@ -170,7 +169,7 @@ T create_rounding_factor(T max_abs, int n)

template <typename T>
T create_gradient_rounding_factor(
const int* head, int nnz, int n_samples, T alpha, rmm::cuda_stream_view stream)
const int* head, uint64_t nnz, int n_samples, T alpha, rmm::cuda_stream_view stream)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Creare an nnz_t please

@@ -195,14 +194,14 @@ T create_gradient_rounding_factor(
* positive weights (neighbors in the 1-skeleton) and repelling
* negative weights (non-neighbors in the 1-skeleton).
*/
template <int TPB_X, typename T>
template <uint64_t TPB_X, typename T>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nnz_t

@@ -301,7 +299,7 @@ template <int TPB_X, typename T>
void launcher(
int m, int n, raft::sparse::COO<T>* in, UMAPParams* params, T* embedding, cudaStream_t stream)
{
int nnz = in->nnz;
uint64_t nnz = in->nnz;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nnz_t

@@ -96,16 +98,15 @@ DI T truncate_gradient(T const rounding_factor, T const x)
return (rounding_factor + x) - rounding_factor;
}

template <typename T, int TPB_X, int n_components>
template <typename T, uint64_t TPB_X, uint64_t n_components>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was this really needed?

T const* tail_embedding,
T* tail_buffer,
const MLCommon::FastIntDiv tail_n,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please don't remove FastIntDiv. Make this support uint64_t if needed.

const int* head,
const int* tail,
int nnz,
uint64_t nnz,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use nnz_t

gen.next(r);
int t = r % tail_n;
uint64_t t = r % tail_n;
Copy link
Member

@cjnolet cjnolet Feb 5, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

define this in a template, please. THis is perf critical.

T const* other = tail_embedding + (k * params.n_components);
uint64_t n_components = params.n_components;

uint64_t j = head[row];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pleae define a template for this.

int t = r % tail_n;
T const* negative_sample = tail_embedding + (t * params.n_components);
dist_squared = rdist<T>(current, negative_sample, params.n_components);
uint64_t t = r % tail_n;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Template

Copy link
Member

@cjnolet cjnolet left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issues as raft, but this is hardcoding types in perf critical code. Please use templates- it allows us to quickly switch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working CMake conda conda issue CUDA/C++ Cython / Python Cython or Python issue non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants