Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Batch transform sparse matrix on Scikit-learn model #1093

Closed
ivankeller opened this issue Oct 16, 2019 · 1 comment
Closed

Batch transform sparse matrix on Scikit-learn model #1093

ivankeller opened this issue Oct 16, 2019 · 1 comment

Comments

@ivankeller
Copy link

ivankeller commented Oct 16, 2019

Reference: 0414058987

I reproduce here a question I submited on stackoverflow (https://stackoverflow.com/questions/58410583/batch-transform-sparse-matrix-with-aws-sagemaker-python-sdk):

I have successfully trained a Scikit-Learn LSVC model with AWS SageMaker.
I want to make batch prediction (aka. batch transform) on a relatively big dataset which is a scipy sparse matrix with shape 252772 x 185128. (The number of features is high because there is one-hot-encoding of bag-of-words and ngrams features).

I struggle because of:

  • the size of the data

  • the format of the data

I did several experiments to check what was going on:

1. predict locally on sample sparse matrix data

It works
Deserialize the model artifact locally on a SageMaker notebook and predict on a sample of the sparse matrix.
This was just to check that the model can predict on this kind of data.

2. Batch Transform on a sample csv data

It works
Launch a Batch Transform Job on SageMaker and request to transform a small sample in dense csv format : it works but does not scale, obviously.
The code is:

sklearn_model = SKLearnModel(
    model_data=model_artifact_location_on_s3,
    entry_point='my_script.py',
    role=role,
    sagemaker_session=sagemaker_session)

transformer = sklearn_model.transformer(
   instance_count=1, 
   instance_type='ml.m4.xlarge', 
   max_payload=100)

transformer.transform(
   data=batch_data, 
   content_type='text/csv',
   split_type=None)   

print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

where:

  • 'my_script.py' implements a simple model_fn to deserialize the model artifact:
def model_fn(model_dir):
    clf = joblib.load(os.path.join(model_dir, "model.joblib"))
    return clf
  • batch_data is the s3 path for the csv file.

3. Batch Transform of a sample dense numpy dataset.

It works
I prepared a sample of the data and saved it to s3 in Numpy .npy format. According to this documentation, SageMaker Scikit-learn model server can deserialize NPY-formatted data (along with JSON and CSV data).
The only difference with the previous experiment (2) is the argument content_type='application/x-npy' in transformer.transform(...).

This solution does not scale and we would like to pass a Scipy sparse matrix:

4. Batch Transform of a big sparse matrix.

Here is the problem
SageMaker Python SDK does not support sparse matrix format out of the box.
Following this:

I used write_spmatrix_to_sparse_tensor to write the data to protobuf format on s3. The function I used is:

def write_protobuf(X_sparse, bucket, prefix, obj):
    """Write sparse matrix to protobuf format at location bucket/prefix/obj."""
    buf = io.BytesIO()
    write_spmatrix_to_sparse_tensor(file=buf, array=X_sparse, labels=None)
    buf.seek(0)
    key = '{}/{}'.format(prefix, obj)
    boto3.resource('s3').Bucket(bucket).Object(key).upload_fileobj(buf)
    return 's3://{}/{}'.format(bucket, key)

Then the code used for launching the batch transform job is:

sklearn_model = SKLearnModel(
    model_data=model_artifact_location_on_s3,
    entry_point='my_script.py',
    role=role,
    sagemaker_session=sagemaker_session)

transformer = sklearn_model.transformer(
   instance_count=1, 
   instance_type='ml.m4.xlarge', 
   max_payload=100)

transformer.transform(
   data=batch_data, 
   content_type='application/x-recordio-protobuf',
   split_type='RecordIO')   

print('Waiting for transform job: ' + transformer.latest_transform_job.job_name)
transformer.wait()

I get the following error:

sagemaker_containers._errors.ClientError: Content type application/x-recordio-protobuf is not supported by this framework.

Questions:
(Reference doc for Transformer: https://sagemaker.readthedocs.io/en/stable/transformer.html)

  • If content_type='application/x-recordio-protobuf' is not allowed, what should I use?
  • Is split_type='RecordIO' the proper setting in this context?
  • Should I provide an input_fn function in my script to deserialize the data?
  • Is there another better approach to tackle this problem?
@ChoiByungWook
Copy link
Contributor

ChoiByungWook commented Oct 16, 2019

Reference: 0414058987

Hi @ivankeller,

Thank you for the clear and concise description!

If content_type='application/x-recordio-protobuf' is not allowed, what should I use?

Unfortunately it looks like there isn't support for 'application/x-recordio-protobuf' within the predefined scikit-learn container.

There are a few issues that reference this as well.

I'll reach out to the corresponding team to give guidance.

Is split_type='RecordIO' the proper setting in this context?

That is the recommended split_type for 'application/x-recordio-protobuf'. https://docs.aws.amazon.com/sagemaker/latest/dg/cdf-inference.html#cm-batch

Unfortunately, it seems that container doesn't support that input yet :(.

Should I provide an input_fn function in my script to deserialize the data?

I recommend providing an input_fn in your script If the default_input_fn doesn't fit your use case.

Is there another better approach to tackle this problem?

I am not a scitkit-learn expert, so I am not too sure. I'll reach out to the corresponding team for guidance.

Based on the details provided, it looks like this isn't an issue with the Python SDK or the batch transform service itself.

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Projects
None yet
Development

No branches or pull requests

5 participants