Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement C/C++ PairwiseANN w/ Python API #268

Merged
merged 1 commit into from
Dec 4, 2023

Conversation

OctoberChang
Copy link
Contributor

Issue #, if available:

Description of changes:
Implement C/C++ PairwiseANN function with Python API.

Given a (input, label) pair, PairwiseANN finds top-K nearest queries in the indexed input-to-label graph.
The searcher return four arrays of size K:

  • Imat: top-K input indices
  • Mmat: the masking indicators. 1 indicates the kNN index is presented; otherwise 0.
  • Dmat: the kNN distances between the test input and top-K indexed input.
  • Vmat: the corresponding values stored in the input-to-label graph.

Indexing usage

train_params = PairwiseANN.TrainParams(metric_type="ip")
model = PairwiseANN.train(X_trn, Y_csr, train_params=train_params)

where

  • X_trn is the dense or sparse input feature matrix
  • Y_csr is the sparse input-to-label graph

Save/load Usage

model.save(model_folder)
del model
model = PairwiseANN.load(model_folder)

Searchers Usage

searchers = model.searchers_create(max_batch_size=100, max_only_topk=10, num_searcher=1)

where

  • max_batch_size defines the maximum number of rows in the returned array
  • max_only_topk defines the maximum number of cols in the returned array

Prediction Usage
With same input (suitable for real-time inference):

It, Mt, Dt, Vt = model.predict(query_vec, label_keys, searchers, pred_params=pred_params, is_same_input=True)
  • query_vec is a single feature vector of shape (1, feat_dim)
  • label_keys is a np.array of shape (batch_size, )

With different inputs (suitable for batch prediction):

It, Mt, Dt, Vt = model.predict(Query_mat, label_keys, searchers, pred_params=pred_params, is_same_input=False)
  • Query_mat is a batch feature matrix of shape (batch_size, feat_dim)
  • label_keys is a np.array of shape (batch_size, )

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

def train(cls, X, Y, train_params=None, pred_params=None):
"""train and return the PairwiseANN indexer
Args:
X (numpy.array or smat.csr_matrix): database matrix to be indexed. (num_item x feat_dim).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing Y in docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. See Revision 2.

"""constructor of PairwiseANN class
Args:
model_ptr (c_void_p): pointer to C instance pecos::ann:PairwiseANN.
num_item (int): number of item being indexed
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please fix docstring.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. See Revision 2.


struct Searcher {
typedef PairwiseANN<feat_vec_t, mat_t> pairwise_ann_t;
typedef pecos::ann::heap_t<pair_t, std::less<pair_t>> max_heap_t;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't need to define this again.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. See Revision 2.

typedef pecos::ann::KeyValPair<index_type, value_type> pair_t;
typedef pecos::ann::heap_t<pair_t, std::less<pair_t>> max_heap_t;

struct Searcher {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indentation

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. See Revision 2.

}
}

void train(const mat_t &X_trn, const pecos::csr_t &Y_trn) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we just assume Y_trn to be CSC here? We can support CSR/CSC at python level.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. See Revision 2.

uint32_t* ret_Mmat, \
float* ret_Dmat, \
float* ret_Vmat, \
const bool is_sample_input) { \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be is_same_input?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. See Revision 2.

"""
return copy.deepcopy(self.pred_params)

def predict(self, input_feat, label_keys, searchers, pred_params=None, is_same_input=False):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there's a more extendable design for the predict API. Now we are basically supporting two cases (1) is_samp_input=True, a csr_vector and (2) is_same_input=False, a coo matrix.

@OctoberChang OctoberChang merged commit 51062ec into amzn:mainline Dec 4, 2023
25 checks passed
@OctoberChang OctoberChang deleted the pairwise-ann branch December 7, 2023 05:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants