Skip to content

A library of additional estimators and SageMaker tools based on scikit-learn

License

Notifications You must be signed in to change notification settings

pdasamzn/sagemaker-scikit-learn-extension

 
 

Repository files navigation

SageMaker Scikit-Learn Extension

License Latest Version Code style: black

SageMaker Scikit-Learn Extension is a Python module for machine learning built on top of scikit-learn.

This project contains standalone scikit-learn estimators and additional tools to support SageMaker Autopilot. Many of the additional estimators are based on existing scikit-learn estimators.

User Installation

To install,

# install from pip
pip install sagemaker-scikit-learn-extension

In order to use the I/O functionalies in the sagemaker_sklearn_extension.externals module, you will also need to install the mlio version 0.2.7 package via conda. The mlio package is only available through conda at the moment.

To install mlio,

# install mlio
conda install -c mlio -c conda-forge mlio-py==0.2.7

You can also install from source by cloning this repository and running a pip install command in the root directory of the repository:

# install from source
git clone https://github.com/aws/sagemaker-scikit-learn-extension.git
cd sagemaker-scikit-learn-extension
pip install -e .

Supported Operating Systems

SageMaker scikit-learn extension supports Unix/Linux and Mac.

Supported Python Versions

SageMaker scikit-learn extension is tested on:

  • Python 3.7

License

This library is licensed under the Apache 2.0 License.

Development

We welcome contributions from developers of all experience levels.

The SageMaker scikit-learn extension is meant to be a repository for scikit-learn estimators that don't meet scikit-learn's stringent inclusion criteria.

Setup

We recommend using conda for development and testing.

To download conda, go to the conda installation guide.

Running Tests

SageMaker scikit-learn extension contains an extensive suite of unit tests.

You can install the libraries needed to run the tests by running pip install --upgrade .[test] or, for Zsh users: pip install --upgrade .\[test\]

For unit tests, tox will use pytest to run the unit tests in a Python 3.7 interpreter. tox will also run flake8 and pylint for style checks.

conda is needed because of the dependency on mlio 0.2.7.

To run the tests with tox, run:

tox

Running on SageMaker

To use sagemaker-scikit-learn-extension on SageMaker, you can build the sagemaker-scikit-learn-extension-container.

Overview of Submodules

  • sagemaker_sklearn_extension.decomposition
    • RobustPCA dimension reduction for dense and sparse inputs
  • sagemaker_sklearn_extension.externals
    • AutoMLTransformer utility class encapsulating feature and target transformation functionality used in SageMaker Autopilot
    • Header utility class to manage the header and target columns in tabular data
    • read_csv_data reads comma separated data and returns a numpy array (uses mlio)
  • sagemaker_sklearn_extension.feature_extraction.date_time
    • DateTimeVectorizer convert datetime objects or strings into numeric features
  • sagemaker_sklearn_extension.feature_extraction.text
    • MultiColumnTfidfVectorizer convert collections of raw documents to a matrix of TF-IDF features
  • sagemaker_sklearn_extension.impute
    • RobustImputer imputer for missing values with customizable mask_function and multi-column constant imputation
    • RobustMissingIndicator binary indicator for missing values with customizable mask_function
  • sagemaker_sklearn_extension.preprocessing
    • BaseExtremeValuesTransformer customizable transformer for columns that contain "extreme" values (columns that are heavy tailed)
    • LogExtremeValuesTransformer stateful log transformer for columns that contain "extreme" values (columns that are heavy tailed)
    • NALabelEncoder encoder for transforming labels to NA values
    • QuadraticFeatures generate and add quadratic features to feature matrix
    • QuantileExtremeValuesTransformer stateful quantiles transformer for columns that contain "extreme" values (columns that are he
    • ThresholdOneHotEncoder encode categorical integer features as a one-hot numeric array, with optional restrictions on feature encoding
    • RemoveConstantColumnsTransformer removes constant columns
    • RobustLabelEncoder encode labels for seen and unseen labels
    • RobustStandardScaler standardization for dense and sparse inputs

About

A library of additional estimators and SageMaker tools based on scikit-learn

Resources

License

Code of conduct

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python 99.5%
  • Shell 0.5%