You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I have successfully trained a Scikit-Learn LSVC model with AWS SageMaker.
I want to make batch prediction (aka. batch transform) on a relatively big dataset which is a scipy sparse matrix with shape 252772 x 185128. (The number of features is high because there is one-hot-encoding of bag-of-words and ngrams features).
I struggle because of:
the size of the data
the format of the data
I did several experiments to check what was going on:
1. predict locally on sample sparse matrix data
It works
Deserialize the model artifact locally on a SageMaker notebook and predict on a sample of the sparse matrix.
This was just to check that the model can predict on this kind of data.
2. Batch Transform on a sample csv data
It works
Launch a Batch Transform Job on SageMaker and request to transform a small sample in dense csv format : it works but does not scale, obviously.
The code is:
3. Batch Transform of a sample dense numpy dataset.
It works
I prepared a sample of the data and saved it to s3 in Numpy .npy format. According to this documentation, SageMaker Scikit-learn model server can deserialize NPY-formatted data (along with JSON and CSV data).
The only difference with the previous experiment (2) is the argument content_type='application/x-npy' in transformer.transform(...).
This solution does not scale and we would like to pass a Scipy sparse matrix:
4. Batch Transform of a big sparse matrix.
Here is the problem
SageMaker Python SDK does not support sparse matrix format out of the box.
Following this:
Reference: 0414058987
I reproduce here a question I submited on stackoverflow (https://stackoverflow.com/questions/58410583/batch-transform-sparse-matrix-with-aws-sagemaker-python-sdk):
I have successfully trained a Scikit-Learn LSVC model with AWS SageMaker.
I want to make batch prediction (aka. batch transform) on a relatively big dataset which is a scipy sparse matrix with shape 252772 x 185128. (The number of features is high because there is one-hot-encoding of bag-of-words and ngrams features).
I struggle because of:
the size of the data
the format of the data
I did several experiments to check what was going on:
1. predict locally on sample sparse matrix data
It works
Deserialize the model artifact locally on a SageMaker notebook and predict on a sample of the sparse matrix.
This was just to check that the model can predict on this kind of data.
2. Batch Transform on a sample csv data
It works
Launch a Batch Transform Job on SageMaker and request to transform a small sample in dense csv format : it works but does not scale, obviously.
The code is:
where:
model_fn
to deserialize the model artifact:batch_data
is the s3 path for the csv file.3. Batch Transform of a sample dense numpy dataset.
It works
I prepared a sample of the data and saved it to s3 in Numpy
.npy
format. According to this documentation, SageMaker Scikit-learn model server can deserialize NPY-formatted data (along with JSON and CSV data).The only difference with the previous experiment (2) is the argument
content_type='application/x-npy'
intransformer.transform(...)
.This solution does not scale and we would like to pass a Scipy sparse matrix:
4. Batch Transform of a big sparse matrix.
Here is the problem
SageMaker Python SDK does not support sparse matrix format out of the box.
Following this:
I used
write_spmatrix_to_sparse_tensor
to write the data to protobuf format on s3. The function I used is:Then the code used for launching the batch transform job is:
I get the following error:
Questions:
(Reference doc for Transformer: https://sagemaker.readthedocs.io/en/stable/transformer.html)
content_type='application/x-recordio-protobuf'
is not allowed, what should I use?split_type='RecordIO'
the proper setting in this context?input_fn
function in my script to deserialize the data?The text was updated successfully, but these errors were encountered: