Welcome to the comprehensive guide on hosting your own inference environment for Hugging Face models using Amazon SageMaker! Whether you're a seasoned developer or just getting started, this tutorial will walk you through the steps to effortlessly deploy and serve your models with confidence. By the end, you'll have a robust setup on AWS SageMaker, empowering you to seamlessly deploy your models for real-world applications.
For our application, we'll be hosting our own instance of GPT2 for text generation!
SageMaker offers managed solutions for the entire machine learning lifecycle. It provides a hassle-free environment to deploy, manage, and scale your models. If you're looking for ways to get your models into production without needing to worry too much about the underlying complexities of model hosting, SageMaker is your tool.
As for Hugging Face - you and I don't have millions of dollars to throw at training advanced neural networks (at least not yet 😉) with billions of parameters. Luckily for us, Hugging Face has changed the game and allowed users to host their own custom models that we can download, fine-tune, and host either in-house in Hugging Face Spaces or seamlessly transfer over to AWS. Hugging Face allows mass availability of open-source models designed with the most cutting-edge methods. I believe expertise around using pre-trained models will grow in demand for MLOps practitioners in the near future, as it's simply too efficient and cost-effective to ignore.
- Setting up an Amazon SageMaker instance
- Preparing your Hugging Face model for deployment
- Deploying your model on SageMaker with ease
- Accessing your deployed model
Before getting started, ensure you have:
- An AWS Account
- A trained Hugging Face model you'd like to host
- Basic knowledge of Python and Linux
- Basic knowledge of AWS services (specifically SageMaker) will be helpful but not required
- for this, we need an IAM User and IAM role
- skip to step 2 if you already have this set up
To authenticate for a command line tool like this one, we need to access AWS through access key credentials. Head over to your AWS console and go to IAM Users. Click on the yellow "Create User" button on the top right:
Make sure you check the box that says "Provide Access to the Management Console". For our purposes, we'll tick the second box to create an IAM User.
To keep things simple, we won't require a new password for the new user's first sign-in.
As for permissions, let's attach policies directly and look for "AmazonSageMakerFullAccess":
Navigate through the remainder of user creation. Once AWS takes you back to your list of all IAM Users, select the IAM User you just created. Navigate to the security credentials tab:
Scroll down a bit to create access keys. When prompted for a use case, select command line interface (should be the first option).
Now, we have an access key and a secret key.
We can now authenticate our environment using these keys. Type in aws configure
, copy and paste your access keys, and select your desired region (we will use us-east-1).
Head over to your AWS console and go to IAM Roles. Click on the yellow "Create Role" button on the top right:
Select AWS service, and SageMaker when prompted. It should automatically attach the SageMaker Full Access policy. Name your role, and continue with the default settings.
Select the role you just made and copy your ARN. Export this to an environment variable
Choose the Hugging Face model you'd like to be hosted. For this project, we'll host our own version of DistilGPT2
I've split up this process into two files. I've made the following changes to the default py deployment file you can find on Hugging Face:
- host.py
- changed the top few lines to load in Role from environment variable
- added some print statements
- removed model querying - this notebook should only deploy the model to SageMaker
- query.py
- this takes command line input that will be passed to our model
- we're using a text generation model, so this will tell the model what to generate
I recommend using these two python files, but if you'd like to create your own you can find the boilerplate on your model's hugging face page:
On the right side of the page, look for Deploy->Amazon SageMaker From here, you can copy and paste the python code into your own file and make changes as you see fit.
Now comes the fun part. Run python host.py
This will take a few minutes - this is the step that deploys your model to SageMaker for inference.
You may need to open query.py and change endpoint_name to your model endpoint's name. To find this, go to your AWS Management Console and head over to SageMaker. On the left side, look for Inference->Endpoints, and copy the name of the endpoint you just created via the host.py script.
Then, run query.py and see your deployed text generation model in action:
As can be seen, the output is truncated due to model constraints, and it doesn't give an output we'd consider ideal given out input. This is a drawback of using a smaller model. For larger models, we'd either have to increase the instance size, or look to another inference solution altogether.
Once you're done using your model, you don't want to be billed for the idle resources. Navigate to SageMaker on the AWS Management Console and scroll down the left hand side until you find "inference". The Model, Endpoint, and Endpoint Configuration should all be deleted.
Fortunately, the python files in this project lend themselves to high reproducibility, and this solution can be spun-up again in minutes.