Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[REVIEW] Add tfidf bm25 #2353

Closed
wants to merge 138 commits into from
Closed

Conversation

jperez999
Copy link

This PR will add support for tfidf and BM25 preprocessing of sparse matrix. It does not require the user to work within the confines of the COO or CSR matrix. It only requires the triplets of data ( row, column, value). With this information, we are able to preprocess the values accordingly. Putting this up to get eyes on this, to make sure this is going in the correct direction or if not, to adjust.

Unit tests are still required for these features.

ajschmidt8 and others added 30 commits July 14, 2020 17:05
[skip ci] Update master references for main branch
[HOTFIX] Remove `-g` from cython compile commands
Our `devel` Docker containers need to be switched to using `conda` compilers to resolve a linking error. `raft` is in those containers, but hasn't yet been built with `conda` compilers. This PR addresses that.

These changes won't cleanly merge into `branch-22.08` unfortunately due to the changes in rapidsai#641, but we can address that another time.

Authors:
   - AJ Schmidt (https://github.com/ajschmidt8)
   - Corey J. Nolet (https://github.com/cjnolet)
   - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
   - Corey J. Nolet (https://github.com/cjnolet)
@shwina I'm going to apologize ahead of time for this, but i was trying to forward merge your branch 22.10 locally to create a new PR from it and I accidentally pushed to your remote branch. I cherry-picked the commits over to a new branch for the hotfix.

Authors:
   - Bradley Dice (https://github.com/bdice)
   - Ashwin Srinath (https://github.com/shwina)

Approvers:
   - Ray Douglass (https://github.com/raydouglass)
[RELEASE] raft v22.12.01 [skip-gpuci]

raft::util::create_dataset<Index_, Type_f>(
handle, rows.view(), columns.view(), values.view(), 5, params.n_rows, params.n_cols);
int non_dupe_nnz_count = raft::util::get_dupe_mask_count<Index_, Type_f>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Declaring the non_dupe_nnz_count as int64_t might be safer since it is used as int64_t in the following code.

@cjnolet cjnolet changed the base branch from branch-24.12 to branch-25.02 December 11, 2024 22:50
@cjnolet
Copy link
Member

cjnolet commented Jan 28, 2025

/ok to test

@@ -59,7 +62,7 @@ namespace raft::sparse::neighbors {
* @param[in] metric distance metric/measure to use
* @param[in] metricArg potential argument for metric (currently unused)
*/
template <typename value_idx = int, typename value_t = float, int TPB_X = 32>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You'll need to go through cuml and make sure this isn't going to break anything once we merge it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removing that code from this PR because this logic now lives in cuvs.

@jperez999
Copy link
Author

/ok to test

1 similar comment
@cjnolet
Copy link
Member

cjnolet commented Jan 29, 2025

/ok to test

Copy link
Member

@rhdong rhdong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

int index = blockIdx.x * blockDim.x + threadIdx.x;
if ((index < nnz) && (counts[index] == 1)) {
int targetVal = cols[index];
int vocab = targetVal % vocabSize;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This line is the equivalent to hash, when we have the right function for the hash we can replace here. This allows number of features to come in. Even if the actual class was built with only a subset of features allowed.

template <typename ValueType = float, typename IndexType = int>
class SparseEncoder {
private:
int* featIdCount;
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This information needs to be saved from fit to fit. How would I propagate this if I used a wrapper function to create this class and apply the logic?

* Value that represents the number of features that exist for the matrices encoded.
* */
template <typename ValueType, typename IndexType>
SparseEncoder<ValueType, IndexType>::SparseEncoder(std::map<int, int> featIdValues,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be good to have a serializer that could export a file that has this information so we can load the fitted class in multiple places.

* The resulting representation of the index value changes of the input. Should be
* the same size as the input (nnz)
*/
__global__ void _scan(int* rows, int nnz, int* counts)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recall this keyword will cause the error in the CI, may need to change to DI.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Development

Successfully merging this pull request may close these issues.

10 participants