[REVIEW] Add tfidf bm25 #2353

jperez999 · 2024-06-05T16:21:07Z

This PR will add support for tfidf and BM25 preprocessing of sparse matrix. It does not require the user to work within the confines of the COO or CSR matrix. It only requires the triplets of data ( row, column, value). With this information, we are able to preprocess the values accordingly. Putting this up to get eyes on this, to make sure this is going in the correct direction or if not, to adjust.

Unit tests are still required for these features.

[skip ci] Update master references for main branch

REL Fix `21.06` Release Changelog

[HOTFIX] Remove `-g` from cython compile commands

[RELEASE] v22.04

Our `devel` Docker containers need to be switched to using `conda` compilers to resolve a linking error. `raft` is in those containers, but hasn't yet been built with `conda` compilers. This PR addresses that. These changes won't cleanly merge into `branch-22.08` unfortunately due to the changes in rapidsai#641, but we can address that another time. Authors: - AJ Schmidt (https://github.com/ajschmidt8) - Corey J. Nolet (https://github.com/cjnolet) - Vyas Ramasubramani (https://github.com/vyasr) Approvers: - Corey J. Nolet (https://github.com/cjnolet)

[RELEASE] v22.06 raft

FIX update-version.sh

@shwina

@shwina I'm going to apologize ahead of time for this, but i was trying to forward merge your branch 22.10 locally to create a new PR from it and I accidentally pushed to your remote branch. I cherry-picked the commits over to a new branch for the hotfix. Authors: - Bradley Dice (https://github.com/bdice) - Ashwin Srinath (https://github.com/shwina) Approvers: - Ray Douglass (https://github.com/raydouglass)

[RELEASE] raft v22.10.01

[RELEASE] raft v22.12.01 [skip-gpuci]

REL Update changelog v23.04

cpp/test/sparse/preprocess_csr.cu

rhdong · 2024-12-11T17:59:02Z

cpp/test/sparse/preprocess_csr.cu

+
+    raft::util::create_dataset<Index_, Type_f>(
+      handle, rows.view(), columns.view(), values.view(), 5, params.n_rows, params.n_cols);
+    int non_dupe_nnz_count = raft::util::get_dupe_mask_count<Index_, Type_f>(


Declaring the non_dupe_nnz_count as int64_t might be safer since it is used as int64_t in the following code.

…to add-tfidf-bm25

cjnolet · 2025-01-28T19:17:21Z

/ok to test

cjnolet · 2025-01-28T19:19:14Z

cpp/include/raft/sparse/neighbors/knn.cuh

@@ -59,7 +62,7 @@ namespace raft::sparse::neighbors {
 * @param[in] metric distance metric/measure to use
 * @param[in] metricArg potential argument for metric (currently unused)
 */
-template <typename value_idx = int, typename value_t = float, int TPB_X = 32>


You'll need to go through cuml and make sure this isn't going to break anything once we merge it.

removing that code from this PR because this logic now lives in cuvs.

…to add-tfidf-bm25

jperez999 · 2025-01-29T02:31:22Z

/ok to test

cjnolet · 2025-01-29T16:10:06Z

/ok to test

rhdong

LGTM

jperez999 · 2025-01-29T17:25:58Z

cpp/include/raft/sparse/matrix/detail/preprocessing.cuh

+  int index = blockIdx.x * blockDim.x + threadIdx.x;
+  if ((index < nnz) && (counts[index] == 1)) {
+    int targetVal = cols[index];
+    int vocab     = targetVal % vocabSize;


This line is the equivalent to hash, when we have the right function for the hash we can replace here. This allows number of features to come in. Even if the actual class was built with only a subset of features allowed.

jperez999 · 2025-01-29T17:26:47Z

cpp/include/raft/sparse/matrix/preprocessing.cuh

+template <typename ValueType = float, typename IndexType = int>
+class SparseEncoder {
+ private:
+  int* featIdCount;


This information needs to be saved from fit to fit. How would I propagate this if I used a wrapper function to create this class and apply the logic?

jperez999 · 2025-01-29T17:28:48Z

cpp/include/raft/sparse/matrix/preprocessing.cuh

+ *   Value that represents the number of features that exist for the matrices encoded.
+ * */
+template <typename ValueType, typename IndexType>
+SparseEncoder<ValueType, IndexType>::SparseEncoder(std::map<int, int> featIdValues,


Would be good to have a serializer that could export a file that has this information so we can load the fitted class in multiple places.

rhdong · 2025-02-03T20:43:06Z

cpp/include/raft/sparse/matrix/detail/preprocessing.cuh

+ *   The resulting representation of the index value changes of the input. Should be
+ *   the same size as the input (nnz)
+ */
+__global__ void _scan(int* rows, int nnz, int* counts)


I recall this keyword will cause the error in the CI, may need to change to DI.

ajschmidt8 and others added 30 commits July 14, 2020 17:05

update master references

a6677ca

REL DOC Updates for main branch switch

ad2d7d7

[skip ci] Update master references for main branch

Merge pull request rapidsai#272 from rapidsai/branch-21.06

e3c9344

REL Fix `21.06` Release Changelog

Merge pull request rapidsai#321 from rapidsai/branch-21.08

3b0a6d2

[HOTFIX] Remove `-g` from cython compile commands

REL v21.08.00 release

309ea1a

Merge pull request rapidsai#612 from rapidsai/branch-22.04

3740998

[RELEASE] v22.04

REL v22.04.00 release

e987ec8

update changelog

229b9f8

Merge pull request rapidsai#708 from rapidsai/branch-22.06

0eded98

[RELEASE] v22.06 raft

FIX update-version.sh

3e5a625

Merge pull request rapidsai#709 from rapidsai/branch-22.06

ad50a7f

FIX update-version.sh

REL v22.06.00 release

ed2c529

Merge pull request rapidsai#782 from rapidsai/branch-22.08

aae5e34

REL v22.08.00 release

87a7d16

Merge pull request rapidsai#908 from rapidsai/branch-22.10

1de93ba

REL v22.10.00 release

31ae597

Merge pull request rapidsai#988 from rapidsai/branch-22.10

c6e6ce8

[RELEASE] raft v22.10.01

REL v22.10.01 release

f7d2335

Merge pull request rapidsai#1063 from rapidsai/branch-22.12

c16fa56

REL v22.12.00 release

9a716b7

Merge pull request rapidsai#1101 from rapidsai/branch-22.12

60936ba

[RELEASE] raft v22.12.01 [skip-gpuci]

REL v22.12.01 release

a655c9a

Merge pull request rapidsai#1250 from rapidsai/branch-23.02

9a66f42

REL v23.02.00 release

69dce2d

Merge pull request rapidsai#1405 from rapidsai/branch-23.04

1467154

REL v23.04.00 release

7d1057e

REL v23.04.01 release

dc800d6

REL Merge pull request rapidsai#1486 from rapidsai/branch-23.04

520e12c

REL Update changelog v23.04

jperez999 commented Dec 11, 2024

View reviewed changes

cpp/test/sparse/preprocess_csr.cu Outdated Show resolved Hide resolved

rhdong reviewed Dec 11, 2024

View reviewed changes

cjnolet changed the base branch from branch-24.12 to branch-25.02 December 11, 2024 22:50

jperez999 and others added 11 commits December 20, 2024 14:29

fix all tests csr and coo

80b527e

Merge branch 'branch-25.02' into add-tfidf-bm25

1c15944

overhaul sparse encoding

c14307d

Merge branch 'add-tfidf-bm25' of https://github.com/jperez999/raft in…

7442b10

…to add-tfidf-bm25

migrate class out of detail

5e90c5f

add basic hash function to ensure stability no matter size

d1365ec

add documentation for details functions.

7eb6f76

full working sparse encoder, batchable

47c8288

merge in 25.02

a013006

remove build sh from template folder

70f5666

add comments spareencoder class

2ff4c8d

cjnolet reviewed Jan 28, 2025

View reviewed changes

jperez999 and others added 4 commits January 28, 2025 19:41

remove bruteforce changes

96e8706

fix formatting issues

7889dd9

Merge branch 'branch-25.02' into add-tfidf-bm25

a73367f

Merge branch 'add-tfidf-bm25' of https://github.com/jperez999/raft in…

efd4cf6

…to add-tfidf-bm25

rhdong approved these changes Jan 29, 2025

View reviewed changes

jperez999 commented Jan 29, 2025

View reviewed changes

add save and load to tests

d6e79e4

rhdong reviewed Feb 3, 2025

View reviewed changes

jperez999 closed this Feb 3, 2025

jperez999 mentioned this pull request Feb 3, 2025

add support for bm25 and tfidf #2567

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[REVIEW] Add tfidf bm25 #2353

[REVIEW] Add tfidf bm25 #2353

jperez999 commented Jun 5, 2024

rhdong Dec 11, 2024

cjnolet commented Jan 28, 2025

cjnolet Jan 28, 2025

jperez999 Jan 29, 2025

jperez999 commented Jan 29, 2025

cjnolet commented Jan 29, 2025

rhdong left a comment

jperez999 Jan 29, 2025

jperez999 Jan 29, 2025

jperez999 Jan 29, 2025

rhdong Feb 3, 2025

[REVIEW] Add tfidf bm25 #2353

[REVIEW] Add tfidf bm25 #2353

Conversation

jperez999 commented Jun 5, 2024

rhdong Dec 11, 2024

Choose a reason for hiding this comment

cjnolet commented Jan 28, 2025

cjnolet Jan 28, 2025

Choose a reason for hiding this comment

jperez999 Jan 29, 2025

Choose a reason for hiding this comment

jperez999 commented Jan 29, 2025

cjnolet commented Jan 29, 2025

rhdong left a comment

Choose a reason for hiding this comment

jperez999 Jan 29, 2025

Choose a reason for hiding this comment

jperez999 Jan 29, 2025

Choose a reason for hiding this comment

jperez999 Jan 29, 2025

Choose a reason for hiding this comment

rhdong Feb 3, 2025

Choose a reason for hiding this comment