TSAI Group Assignment

Group Members:

Arjun Gupta
Himanshu
Aeshna Singh
Palash Baranwal

Session 13 - AWS Sagemaker

ASSIGNMENT

Go through this notebook (Links to an external site.), change the dataset (to anything other than "Amazon Reviews Polarity Dataset").
- https://github.com/aws-samples/finetune-deploy-bert-with-amazon-sagemaker-for-hugging-face/blob/main/finetune-distilbert.ipynb
Record the whole step using any video capturing software (like OBS).
Upload the video on youtube (you can make it unlisted, but allow for embedding)
Share the link to your YouTube video, and the GitHub link where I can see the code you used for training (move the notebook with logs from Amazon to Github)

YOUTUBE LINK

https://youtu.be/M1mfEy8NfDU

What is AWS Sagemaker?

Amazon SageMaker is a fully-managed service that enables data scientists and developers to quickly and easily build, train, and deploy machine learning models at any scale.
Amazon SageMaker includes modules that can be used together or independently to build, train, and deploy your machine learning models.

Build

Amazon SageMaker makes it easy to build ML models and get them ready for training by providing everything you need to quickly connect to your training data, and to select and optimize the best algorithm and framework for your application.

Train

You can begin training your model with a single click in the Amazon SageMaker console. Amazon SageMaker manages all of the underlying infrastructure for you and can easily scale to train models at petabyte scale.

Deploy

Once your model is trained and tuned, Amazon SageMaker makes it easy to deploy in production so you can start running generating predictions on new data (a process called inference).

Amazon SageMaker takes away the heavy lifting of machine learning, so you can build, train, and deploy machine learning models quickly and easily.

DISTRIBUTED

To scale and accelerate our training we will use Amazon SageMaker, which provides two strategies for distributed training, data parallelism and model parallelism.

Data parallelism splits a training set across several GPUs
Model parallelism splits a model across several GPUs.
We are going to use SageMaker Data Parallelism, which has been built into the Trainer API. To be able use data-parallelism we only have to define the distribution parameter in our HuggingFace estimator.

DATASET USED

amazon_reviews_multi

It is an Amazon product reviews dataset for multilingual text classification.
The dataset contains reviews in English, Japanese, German, French, Chinese and Spanish, collected between November 1, 2015 and November 1, 2019. Each record in the dataset contains the review text, the review title, the star rating, an anonymized reviewer ID, an anonymized product ID and the coarse-grained product category (e.g. ‘books’, ‘appliances’, etc.)
Dataset Link: https://huggingface.co/datasets/amazon_reviews_multi
Data Fields
- review_id: A string identifier of the review.
- product_id: A string identifier of the product being reviewed.
- reviewer_id: A string identifier of the reviewer.
- stars: An int between 1-5 indicating the number of stars.
- review_body: The text body of the review.
- review_title: The text title of the review.
- language: The string identifier of the review language.
- product_category: String representation of the product's category.

DATA SAMPLE

{
   "review_id": "de_0784695",
   "product_id": "product_de_0572654",
   "reviewer_id": "reviewer_de_0645436",
   "stars": "1",
   "review_body": "Leider, leider nach einmal waschen ausgeblichen . Es sieht super h\u00fcbsch aus , nur leider stinkt es ganz schrecklich und ein Waschgang in der Maschine ist notwendig ! Nach einem mal waschen sah es aus als w\u00e4re es 10 Jahre alt und hatte 1000 e von Waschg\u00e4ngen hinter sich :( echt schade !",
   "review_title": "Leider nicht zu empfehlen",
   "language": "de",
   "product_category": "home"
}

The amazon_reviews_multi has 5 classes (stars) to match those into a sentiment-analysis task we will map those star ratings to the following classes labels:

(1-2): Negative
(3): Neutral
(4-5): Positive

MODEL USED

DistilBERT base multilingual model (cased)

The model is trained on the concatenation of Wikipedia in 104 different languages listed here. The model has 6 layers, 768 dimension and 12 heads, totalizing 134M parameters (compared to 177M parameters for mBERT-base).
On average DistilmBERT is twice as fast as mBERT-base.
Model Link: https://huggingface.co/distilbert-base-multilingual-cased

LABELS DISTRIBUTION IN TRAIN DATASET

AWS SAGEMAKER NOTEBOOK INSTANCE

AWS SAGEMAKER TRAINING JOB

Name: finetune-distilbert-base-multilingual-cased-2022-02-02-14-58-39
Training time (seconds): 17 minute(s) and 33 seconds
Billable time (seconds): 5 minute(s) and 16 second(s)
Managed spot training savings: 70%
Instance Type: ml.p3.16xlarge
Instance count: 1

TRAINING LOGS

EVALUATION RESULTS

epoch = 3.0
eval_accuracy = 0.7614
eval_f1 = 0.7614
eval_loss = 0.5882487297058105
eval_runtime = 1.5266
eval_samples_per_second = 3275.246
eval_steps_per_second = 26.202

PREDICTION OUTPUT

HUGGING FACE UPLOADED MODEL

HUGGING FACE HUB MODEL: https://huggingface.co/arjuntheprogrammer/distilbert-base-multilingual-cased-sentiment-2

REFERENCES

Hugging Face on Amazon SageMaker: https://huggingface.co/docs/sagemaker/main
Deploy models to Amazon SageMaker: https://huggingface.co/docs/sagemaker/inference
Run training on Amazon SageMaker: https://huggingface.co/docs/sagemaker/train
Distributed training on multilingual BERT with Hugging Face Transformers & Amazon SageMaker: https://www.philschmid.de/pytorch-distributed-training-transformers
Available SageMaker Studio Instance Types: https://docs.aws.amazon.com/sagemaker/latest/dg/notebooks-available-instance-types.html
Example Notebook: https://github.com/aws-samples/finetune-deploy-bert-with-amazon-sagemaker-for-hugging-face/blob/main/finetune-distilbert.ipynb
Transformers: https://github.com/huggingface/transformers

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
scripts		scripts
README.md		README.md
Session13.ipynb		Session13.ipynb

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

TSAI Group Assignment

Session 13 - AWS Sagemaker

YOUTUBE LINK

What is AWS Sagemaker?

DISTRIBUTED

DATASET USED

MODEL USED

LABELS DISTRIBUTION IN TRAIN DATASET

AWS SAGEMAKER NOTEBOOK INSTANCE

AWS SAGEMAKER TRAINING JOB

TRAINING LOGS

EVALUATION RESULTS

PREDICTION OUTPUT

HUGGING FACE UPLOADED MODEL

REFERENCES

About

Releases

Packages

Languages

TSAI-END3/Session13

Folders and files

Latest commit

History

Repository files navigation

TSAI Group Assignment

Session 13 - AWS Sagemaker

YOUTUBE LINK

What is AWS Sagemaker?

DISTRIBUTED

DATASET USED

MODEL USED

LABELS DISTRIBUTION IN TRAIN DATASET

AWS SAGEMAKER NOTEBOOK INSTANCE

AWS SAGEMAKER TRAINING JOB

TRAINING LOGS

EVALUATION RESULTS

PREDICTION OUTPUT

HUGGING FACE UPLOADED MODEL

REFERENCES

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages