-
Notifications
You must be signed in to change notification settings - Fork 201
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[REVIEW] Add tfidf bm25 #2353
[REVIEW] Add tfidf bm25 #2353
Conversation
[skip ci] Update master references for main branch
REL Fix `21.06` Release Changelog
[HOTFIX] Remove `-g` from cython compile commands
[RELEASE] v22.04
Our `devel` Docker containers need to be switched to using `conda` compilers to resolve a linking error. `raft` is in those containers, but hasn't yet been built with `conda` compilers. This PR addresses that. These changes won't cleanly merge into `branch-22.08` unfortunately due to the changes in rapidsai#641, but we can address that another time. Authors: - AJ Schmidt (https://github.com/ajschmidt8) - Corey J. Nolet (https://github.com/cjnolet) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Corey J. Nolet (https://github.com/cjnolet)
[RELEASE] v22.06 raft
FIX update-version.sh
@shwina I'm going to apologize ahead of time for this, but i was trying to forward merge your branch 22.10 locally to create a new PR from it and I accidentally pushed to your remote branch. I cherry-picked the commits over to a new branch for the hotfix. Authors: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) Approvers: - Ray Douglass (https://github.com/raydouglass)
[RELEASE] raft v22.10.01
[RELEASE] raft v22.12.01 [skip-gpuci]
REL Update changelog v23.04
cpp/test/sparse/preprocess_csr.cu
Outdated
|
||
raft::util::create_dataset<Index_, Type_f>( | ||
handle, rows.view(), columns.view(), values.view(), 5, params.n_rows, params.n_cols); | ||
int non_dupe_nnz_count = raft::util::get_dupe_mask_count<Index_, Type_f>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Declaring the non_dupe_nnz_count
as int64_t
might be safer since it is used as int64_t
in the following code.
…to add-tfidf-bm25
/ok to test |
@@ -59,7 +62,7 @@ namespace raft::sparse::neighbors { | |||
* @param[in] metric distance metric/measure to use | |||
* @param[in] metricArg potential argument for metric (currently unused) | |||
*/ | |||
template <typename value_idx = int, typename value_t = float, int TPB_X = 32> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You'll need to go through cuml and make sure this isn't going to break anything once we merge it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removing that code from this PR because this logic now lives in cuvs.
/ok to test |
1 similar comment
/ok to test |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
int index = blockIdx.x * blockDim.x + threadIdx.x; | ||
if ((index < nnz) && (counts[index] == 1)) { | ||
int targetVal = cols[index]; | ||
int vocab = targetVal % vocabSize; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This line is the equivalent to hash, when we have the right function for the hash we can replace here. This allows number of features to come in. Even if the actual class was built with only a subset of features allowed.
template <typename ValueType = float, typename IndexType = int> | ||
class SparseEncoder { | ||
private: | ||
int* featIdCount; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This information needs to be saved from fit to fit. How would I propagate this if I used a wrapper function to create this class and apply the logic?
* Value that represents the number of features that exist for the matrices encoded. | ||
* */ | ||
template <typename ValueType, typename IndexType> | ||
SparseEncoder<ValueType, IndexType>::SparseEncoder(std::map<int, int> featIdValues, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be good to have a serializer that could export a file that has this information so we can load the fitted class in multiple places.
* The resulting representation of the index value changes of the input. Should be | ||
* the same size as the input (nnz) | ||
*/ | ||
__global__ void _scan(int* rows, int nnz, int* counts) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recall this keyword will cause the error in the CI, may need to change to DI
.
This PR will add support for tfidf and BM25 preprocessing of sparse matrix. It does not require the user to work within the confines of the COO or CSR matrix. It only requires the triplets of data ( row, column, value). With this information, we are able to preprocess the values accordingly. Putting this up to get eyes on this, to make sure this is going in the correct direction or if not, to adjust.
Unit tests are still required for these features.