From 7434a08471a0199839e5b4245ffe22036b345836 Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 16:20:16 -0600 Subject: [PATCH 01/16] Update index.md --- index.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/index.md b/index.md index 8e23b13..620c9d9 100644 --- a/index.md +++ b/index.md @@ -4,7 +4,7 @@ site: sandpaper::sandpaper_site ## Workshop Overview -This workshop introduces you to foundational workflows in **Aamazon SageMaker**, covering data setup, code repo setup, model training, and hyperparameter tuning within AWS's managed environment. You’ll learn how to use SageMaker notebooks to control data pipelines, manage training jobs, and evaluate model performance effectively. We’ll also cover strategies to help you scale training and tuning efficiently, with guidance on choosing between CPUs and GPUs, as well as when to consider parallelized training. +This workshop introduces you to foundational workflows in **Aamazon SageMaker**, covering data setup, code repo setup, model training, and hyperparameter tuning within AWS's managed environment. You’ll learn how to use SageMaker notebooks to control data pipelines, manage training and tuning jobs, and evaluate model performance effectively. We’ll also cover strategies to help you scale training and tuning efficiently, with guidance on choosing between CPUs and GPUs, as well as when to consider parallelized workflows (i.e., using multiple instances). To keep costs manageable, this workshop provides tips for tracking and monitoring AWS expenses, so your experiments remain affordable. While AWS isn’t entirely free, it's very cost-effective for typical ML workflows—training roughly 100 models on a small dataset (under 10GB) can cost under $20, making it accessible for many research projects. @@ -12,10 +12,9 @@ To keep costs manageable, this workshop provides tips for tracking and monitorin Currently, this workshop does not include: -- Model deployment via endpoints - **AWS Lambda** for serverless function deployment, - **MLFlow** or other MLOps tools for experiment tracking, - Additional AWS services beyond the core SageMaker ML workflows. -If there’s a specific ML workflow or AWS service you’d like to see included in this curriculum, we’re open to developing more content to meet the needs of researchers and ML practitioners at UW–Madison. Please contact [endemann@wisc.edu](mailto:endemann@wisc.edu) with suggestions or requests. +If there’s a specific ML workflow or AWS service you’d like to see included in this curriculum, we’re open to developing more content to meet the needs of researchers and ML practitioners at UW–Madison (and at other researcher institutes). Please contact [endemann@wisc.edu](mailto:endemann@wisc.edu) with suggestions or requests. From 98aa851b2fd0e2c21d1c81a465db75c140c1a13a Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 16:21:24 -0600 Subject: [PATCH 02/16] Update SageMaker-overview.md --- episodes/SageMaker-overview.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/episodes/SageMaker-overview.md b/episodes/SageMaker-overview.md index 1366412..eb623c6 100644 --- a/episodes/SageMaker-overview.md +++ b/episodes/SageMaker-overview.md @@ -6,18 +6,18 @@ exercises: 0 Amazon SageMaker is a comprehensive machine learning platform that empowers users to build, train, tune, and deploy models at scale. Designed to streamline the ML workflow, SageMaker supports data scientists and researchers in tackling complex machine learning problems without needing to manage underlying infrastructure. This allows you to focus on developing and refining your models while leveraging AWS’s robust computing resources for efficient training and deployment. -### Why Use SageMaker for Machine Learning? +### Why use SageMaker for machine learning? SageMaker provides several features that make it an ideal choice for researchers and ML practitioners: -- **End-to-End Workflow**: SageMaker covers the entire ML pipeline, from data preprocessing to model deployment. This unified environment reduces the need to switch between platforms or tools, enabling users to set up, train, tune, and deploy models seamlessly. +- **End-to-end workflow**: SageMaker covers the entire ML pipeline, from data preprocessing to model deployment. This unified environment reduces the need to switch between platforms or tools, enabling users to set up, train, tune, and deploy models seamlessly. -- **Flexible Compute Options**: SageMaker lets you easily select instance types tailored to your project needs. For compute-intensive tasks, such as training deep learning models, you can switch to GPU instances for faster processing. SageMaker’s scalability also supports parallelized training, enabling you to distribute large training jobs across multiple instances, which can significantly speed up training time for large datasets and complex models. +- **Flexible compute options**: SageMaker lets you easily select instance types tailored to your project needs. For compute-intensive tasks, such as training deep learning models, you can switch to GPU instances for faster processing. SageMaker’s scalability also supports parallelized training, enabling you to distribute large training jobs across multiple instances, which can significantly speed up training time for large datasets and complex models. -- **Efficient Hyperparameter Tuning**: SageMaker provides powerful tools for automated hyperparameter tuning, allowing users to perform complex cross-validation (CV) searches with a single chunk of code. This feature enables you to explore a wide range of parameters and configurations efficiently, helping you find optimal models without manually managing multiple training runs. +- **Efficient hyperparameter tuning**: SageMaker provides powerful tools for automated hyperparameter tuning, allowing users to perform complex cross-validation (CV) searches with a single chunk of code. This feature enables you to explore a wide range of parameters and configurations efficiently, helping you find optimal models without manually managing multiple training runs. - **Support for Custom Scripts**: While SageMaker offers built-in algorithms, it also allows users to bring their own customized scripts. This flexibility is crucial for researchers developing unique models or custom algorithms. SageMaker’s support for Docker containers allows you to deploy fully customized code for training, tuning, and inference on scalable AWS infrastructure. -- **Cost Management and Monitoring**: SageMaker includes built-in monitoring tools to help you track and manage costs, ensuring you can scale up efficiently without unnecessary expenses. With thoughtful usage, SageMaker can be very affordable—for example, training roughly 100 models on a small dataset (under 1GB) can cost less than $20, making it accessible for many research projects. +- **Cost management and monitoring**: SageMaker includes built-in monitoring tools to help you track and manage costs, ensuring you can scale up efficiently without unnecessary expenses. With thoughtful usage, SageMaker can be very affordable—for example, training roughly 100 models on a small dataset (under 1GB) can cost less than $20, making it accessible for many research projects. SageMaker is designed to support machine learning at any scale, making it a strong choice for projects ranging from small experiments to large research deployments. With robust tools for every step of the ML process, it empowers researchers and practitioners to bring their models from development to production efficiently and effectively. From 80bac0ae453d22f7c0380fad26d8edd7e6b8f9cb Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 16:29:04 -0600 Subject: [PATCH 03/16] Update Data-storage-setting-up-S3.md --- episodes/Data-storage-setting-up-S3.md | 30 +++++++++++--------------- 1 file changed, 12 insertions(+), 18 deletions(-) diff --git a/episodes/Data-storage-setting-up-S3.md b/episodes/Data-storage-setting-up-S3.md index 253d563..089cfb2 100644 --- a/episodes/Data-storage-setting-up-S3.md +++ b/episodes/Data-storage-setting-up-S3.md @@ -24,24 +24,27 @@ exercises: 5 > **Hackathon Attendees**: All data uploaded to AWS must relate to your specific Kaggle challenge, except for auxiliary datasets for transfer learning or pretraining. **DO NOT upload any restricted or sensitive data to AWS.** ## Options for storage: EC2 Instance or S3 +When working with SageMaker and other AWS services, you have options for data storage, primarily **EC2 instances** or **S3**. -### When to store data directly on EC2 (e.g., in Jupyter Notebook instance) +#### What is an EC2 instance? +An Amazon EC2 (Elastic Compute Cloud) instance is a virtual server environment where you can run applications, process data, and store data temporarily. EC2 instances come in various types and sizes to meet different computing and memory needs, making them versatile for tasks ranging from light web servers to intensive machine learning workloads. In SageMaker, the notebook instance itself is an EC2 instance configured to run Jupyter notebooks, enabling direct data processing. -Using EC2 for data storage can be a quick solution for certain temporary needs. An EC2 instance provides a virtual server environment with its own local storage, which can be used to store and process data directly on the instance. This method is suitable for temporary or small datasets and for one-off experiments that don’t require long-term data storage or frequent access from multiple services. +#### When to store data directly on EC2 +Using an EC2 instance for data storage can be useful for temporary or small datasets, especially during processing within a Jupyter notebook. However, this storage is not persistent; if the instance is stopped or terminated, the data is erased. Therefore, EC2 is ideal for one-off experiments or intermediate steps in data processing. + +**Limitations of EC2 storage** -#### Limitations of EC2 storage: - **Scalability**: EC2 storage is limited to the instance’s disk capacity, so it may not be ideal for very large datasets. - **Cost**: EC2 storage can be more costly for long-term use compared to S3. - **Data Persistence**: EC2 data may be lost if the instance is stopped or terminated, unless using Elastic Block Store (EBS) for persistent storage. - ### What is an S3 bucket? Storing data in an **S3 bucket** is generally preferred for machine learning workflows on AWS, especially when using SageMaker. An S3 bucket is a container in Amazon S3 (Simple Storage Service) where you can store, organize, and manage data files. Buckets act as the top-level directory within S3 and can hold a virtually unlimited number of files and folders, making them ideal for storing large datasets, backups, logs, or any files needed for your project. You access objects in a bucket via a unique **S3 URI** (e.g., `s3://your-bucket-name/your-file.csv`), which you can use to reference data across various AWS services like EC2 and SageMaker. ::::::::::::::::::::::::::::::::::::: callout ### Benefits of using S3 (recommended for SageMaker and ML workflows) -The benefits will become more clear as you progress through these materials. However, to point out the most important benefits briefly... +For flexibility, scalability, and cost efficiency, store data in S3 and load it into EC2 as needed. This setup allows: - **Scalability**: S3 handles large datasets efficiently, enabling storage beyond the limits of an EC2 instance's disk space. - **Cost efficiency**: S3 storage costs are generally lower than expanding EC2 disk volumes. You only pay for the storage you use. @@ -52,15 +55,6 @@ The benefits will become more clear as you progress through these materials. How :::::::::::::::::::::::::::::::::::::::::::::::: - -## Recommended approach: Use S3 for data storage - -For flexibility, scalability, and cost efficiency, store data in S3 and load it into EC2 as needed. This setup allows: - -- Starting and stopping EC2 instances as needed -- Scaling storage without reconfiguring the instance -- Seamless integration across AWS services - ### Summary steps to access S3 and upload your dataset 1. Log in to AWS Console and navigate to S3. @@ -68,17 +62,17 @@ For flexibility, scalability, and cost efficiency, store data in S3 and load it 3. Upload your dataset files. 4. Use the object URL to reference your data in future experiments. -### Detailed procedure: +### Detailed procedure -1. **Sign in to the AWS Management Console**: +1. **Sign in to the AWS Management Console** - Log in to AWS Console using your credentials. -2. **Navigate to S3**: +2. **Navigate to S3** - Type "S3" in the search bar - Protip: select the star icon to save S3 as a bookmark in your AWS toolbar - Select **S3 - Scalable Storage in the Cloud** -4. **Create a new bucket**: +4. **Create a new bucket** - Click **Create Bucket** and enter a unique name. **Hackathon participants**: Use the following convention for your bucket name: `TeamName-DatasetName` (e.g., `MyAwesomeTeam-TitanicData`). - **Region**: Leave as is (likely `us-east-1` (US East N. Virginia)) - **Access Control**: Disable ACLs (recommended). From 6a93343a1da8b74cf451ca08606768e766712fec Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 16:31:56 -0600 Subject: [PATCH 04/16] Update Data-storage-setting-up-S3.md --- episodes/Data-storage-setting-up-S3.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/episodes/Data-storage-setting-up-S3.md b/episodes/Data-storage-setting-up-S3.md index 089cfb2..1f76081 100644 --- a/episodes/Data-storage-setting-up-S3.md +++ b/episodes/Data-storage-setting-up-S3.md @@ -32,12 +32,16 @@ An Amazon EC2 (Elastic Compute Cloud) instance is a virtual server environment w #### When to store data directly on EC2 Using an EC2 instance for data storage can be useful for temporary or small datasets, especially during processing within a Jupyter notebook. However, this storage is not persistent; if the instance is stopped or terminated, the data is erased. Therefore, EC2 is ideal for one-off experiments or intermediate steps in data processing. -**Limitations of EC2 storage** +::::::::::::::::::::::::::::::::::::: callout + +### Limitations of EC2 storage - **Scalability**: EC2 storage is limited to the instance’s disk capacity, so it may not be ideal for very large datasets. - **Cost**: EC2 storage can be more costly for long-term use compared to S3. - **Data Persistence**: EC2 data may be lost if the instance is stopped or terminated, unless using Elastic Block Store (EBS) for persistent storage. +:::::::::::::::::::::::::::::::::::::::::::::::: + ### What is an S3 bucket? Storing data in an **S3 bucket** is generally preferred for machine learning workflows on AWS, especially when using SageMaker. An S3 bucket is a container in Amazon S3 (Simple Storage Service) where you can store, organize, and manage data files. Buckets act as the top-level directory within S3 and can hold a virtually unlimited number of files and folders, making them ideal for storing large datasets, backups, logs, or any files needed for your project. You access objects in a bucket via a unique **S3 URI** (e.g., `s3://your-bucket-name/your-file.csv`), which you can use to reference data across various AWS services like EC2 and SageMaker. @@ -55,6 +59,8 @@ For flexibility, scalability, and cost efficiency, store data in S3 and load it :::::::::::::::::::::::::::::::::::::::::::::::: +## Recommended approach: S3 buckets + ### Summary steps to access S3 and upload your dataset 1. Log in to AWS Console and navigate to S3. From 26349123c8881a4ffd195a99440c7b8f66997796 Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 16:46:58 -0600 Subject: [PATCH 05/16] Update Data-storage-setting-up-S3.md --- episodes/Data-storage-setting-up-S3.md | 102 ++++++++++++++++++++----- 1 file changed, 83 insertions(+), 19 deletions(-) diff --git a/episodes/Data-storage-setting-up-S3.md b/episodes/Data-storage-setting-up-S3.md index 1f76081..4cdb0e5 100644 --- a/episodes/Data-storage-setting-up-S3.md +++ b/episodes/Data-storage-setting-up-S3.md @@ -125,46 +125,110 @@ Once the bucket is created, you'll be brought to a page that shows all of your c For hackathon attendees, this policy grants the `ml-sagemaker-use` IAM role access to specific S3 bucket actions, ensuring they can use the bucket for reading, writing, deleting, and listing parts during multipart uploads. Attendees should apply this policy to their buckets to enable SageMaker to operate on stored data. +::::::::::::::::::::::::::::::::::::: callout + ### General guidance for setting up permissions outside the hackathon -> For those not participating in the hackathon, it’s essential to create a similar IAM role (such as `ml-sagemaker-use`) with policies that provide controlled access to S3 resources, ensuring only the necessary actions are permitted for security and cost-efficiency. -> -> 1. **Create an IAM role**: Set up an IAM role for SageMaker to assume, with necessary S3 access permissions, such as `s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, and `s3:ListMultipartUploadParts`, as shown in the policy above. -> -> 2. **Attach permissions to S3 buckets**: Attach bucket policies that specify this role as the principal, as in the hackathon example. -> -> 3. **More information**: For a detailed guide on setting up roles and policies for SageMaker, refer to the [AWS SageMaker documentation on IAM roles and policies](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). This resource explains role creation, permission setups, and policy best practices tailored for SageMaker’s operations with S3 and other AWS services. -> -> This setup ensures that your SageMaker operations will have the access needed without exposing the bucket to unnecessary permissions or external accounts. +For those not participating in the hackathon, it’s essential to create a similar IAM role (such as `ml-sagemaker-use`) with policies that provide controlled access to S3 resources, ensuring only the necessary actions are permitted for security and cost-efficiency. + +a. **Create an IAM role**: Set up an IAM role for SageMaker to assume, with necessary S3 access permissions, such as `s3:GetObject`, `s3:PutObject`, `s3:DeleteObject`, and `s3:ListMultipartUploadParts`, as shown in the policy above. + +b. **Attach permissions to S3 buckets**: Attach bucket policies that specify this role as the principal, as in the hackathon example. + +c. **More information**: For a detailed guide on setting up roles and policies for SageMaker, refer to the [AWS SageMaker documentation on IAM roles and policies](https://docs.aws.amazon.com/sagemaker/latest/dg/sagemaker-roles.html). This resource explains role creation, permission setups, and policy best practices tailored for SageMaker’s operations with S3 and other AWS services. + +This setup ensures that your SageMaker operations will have the access needed without exposing the bucket to unnecessary permissions or external accounts. + +:::::::::::::::::::::::::::::::::::::::::::::::: 7. **Upload files to the bucket**: - Navigate to the Objects tab of your bucket, then **Upload**. - **Add Files** (e.g., `titanic_train.csv`, `titanic_test.csv`) and click **Upload** to complete. 5. **Getting the S3 URI for your data**: - - After uploading, click on a file to find its **Object URI** (e.g., `s3://titanic-dataset-test/test.csv`). Use this URI to load data into SageMaker or EC2. + - After uploading, click on a file to find its **Object URI** (e.g., `s3://titanic-dataset-test/test.csv`). We'll use this URI to load data into SageMaker later. ## S3 bucket costs S3 bucket storage incurs costs based on data storage, data transfer, and request counts. -### Storage costs: -- Storage is charged per GB per month. -- Example: Storing 10 GB costs approximately $0.23/month in S3 Standard. -- **Pricing Tiers**: S3 offers multiple storage classes (Standard, Intelligent-Tiering, Glacier, etc.), with different costs based on access frequency and retrieval times. -- To calculate specific costs based on your needs, refer to AWS's [S3 Pricing Information](https://aws.amazon.com/s3/pricing/). +### Storage costs +- Storage is charged per GB per month. Typical: Storing 10 GB costs approximately $0.23/month in S3 Standard (us-east-1). +- Pricing Tiers: S3 offers multiple storage classes (Standard, Intelligent-Tiering, Glacier, etc.), with different costs based on access frequency and retrieval times. Standard S3 fits most purposes. If you're curious about other tiers, refer to AWS's [S3 Pricing Information](https://aws.amazon.com/s3/pricing/). +- To calculate specific costs based on your needs, storage class, and region, refer to AWS's [S3 Pricing Information](https://aws.amazon.com/s3/pricing/). -### Data transfer costs: +### Data transfer costs - **Uploading** data to S3 is free. -- **Downloading** data (out of S3) incurs charges (~$0.09/GB). +- **Downloading** data (out of S3) incurs charges (~$0.09/GB). Be sure to take note of this fee, as it can add up fast for large datasets. - **In-region transfer** (e.g., S3 to EC2) is free, while cross-region data transfer is charged (~$0.02/GB). > **[Data transfer pricing](https://aws.amazon.com/s3/pricing/)** -### Request costs: -- GET requests are $0.0004 per 1,000 requests. +### Request costs +- GET requests are $0.0004 per 1,000 requests. In the context of Amazon S3, "GET" requests refer to the action of retrieving or downloading data from an S3 bucket. Each time a file or object is accessed in S3, it incurs a small cost per request. This means that if you have code that reads data from S3 frequently, such as loading datasets repeatedly, each read operation counts as a GET request. > **[Request Pricing](https://aws.amazon.com/s3/pricing/)** +::::::::::::::::::::::::::::::::::::: challenge + +### Challenge Exercise: Calculate Your Project's Data Costs + +Estimate the total cost of storing your project data in S3 for one month, using the following dataset sizes and assuming: + +- Storage duration: 1 month +- Storage region: us-east-1 +- Storage class: S3 Standard +- Data will be retrieved 100 times for model training (`GET` requests) +- Data will be deleted after the project concludes, incurring data retrieval and deletion costs + +Dataset sizes to consider: +- 1 GB +- 10 GB +- 100 GB +- 1 TB + +**Hints**: +- S3 storage cost: $0.023 per GB per month (us-east-1) +- Data transfer cost (retrieval/deletion): $0.09 per GB (us-east-1 out to internet) +- `GET` requests cost: $0.0004 per 1,000 requests (each model training will incur one `GET` request) + +Check the [AWS S3 Pricing](https://aws.amazon.com/s3/pricing/) page for more details. + +:::::::::::::::::::::::::::::::::::::::::::::::: + +:::::::::::::::::::::::::::::::::::: solution + +### Solution + +Using the S3 Standard rate in us-east-1: + +1. **1 GB**: + - **Storage**: 1 GB * $0.023 = $0.023 + - **Retrieval/Deletion**: 1 GB * $0.09 = $0.09 + - **GET Requests**: 100 requests * $0.0004 per 1,000 = $0.00004 + - **Total Cost**: **$0.11304** + +2. **10 GB**: + - **Storage**: 10 GB * $0.023 = $0.23 + - **Retrieval/Deletion**: 10 GB * $0.09 = $0.90 + - **GET Requests**: 100 requests * $0.0004 per 1,000 = $0.00004 + - **Total Cost**: **$1.13004** + +3. **100 GB**: + - **Storage**: 100 GB * $0.023 = $2.30 + - **Retrieval/Deletion**: 100 GB * $0.09 = $9.00 + - **GET Requests**: 100 requests * $0.0004 per 1,000 = $0.00004 + - **Total Cost**: **$11.30004** + +4. **1 TB (1024 GB)**: + - **Storage**: 1024 GB * $0.023 = $23.55 + - **Retrieval/Deletion**: 1024 GB * $0.09 = $92.16 + - **GET Requests**: 100 requests * $0.0004 per 1,000 = $0.00004 + - **Total Cost**: **$115.71004** + +These costs assume no additional request charges beyond those for retrieval, storage, and `GET` requests for training. + +:::::::::::::::::::::::::::::::::::::::::::::::: + ## Removing unused data Choose one of these options: From 13153362f77253552662d83943bf4ba483e7b130 Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 16:52:51 -0600 Subject: [PATCH 06/16] Update Data-storage-setting-up-S3.md --- episodes/Data-storage-setting-up-S3.md | 20 ++++++++++++-------- 1 file changed, 12 insertions(+), 8 deletions(-) diff --git a/episodes/Data-storage-setting-up-S3.md b/episodes/Data-storage-setting-up-S3.md index 4cdb0e5..e5466c5 100644 --- a/episodes/Data-storage-setting-up-S3.md +++ b/episodes/Data-storage-setting-up-S3.md @@ -73,12 +73,14 @@ For flexibility, scalability, and cost efficiency, store data in S3 and load it 1. **Sign in to the AWS Management Console** - Log in to AWS Console using your credentials. + 2. **Navigate to S3** - Type "S3" in the search bar - Protip: select the star icon to save S3 as a bookmark in your AWS toolbar - Select **S3 - Scalable Storage in the Cloud** -4. **Create a new bucket** + +3. **Create a new bucket** - Click **Create Bucket** and enter a unique name. **Hackathon participants**: Use the following convention for your bucket name: `TeamName-DatasetName` (e.g., `MyAwesomeTeam-TitanicData`). - **Region**: Leave as is (likely `us-east-1` (US East N. Virginia)) - **Access Control**: Disable ACLs (recommended). @@ -92,12 +94,13 @@ For flexibility, scalability, and cost efficiency, store data in S3 and load it - Click **Create Bucket** at the bottom once everything above has been configured -5. **Edit bucket policy** + +4. **Edit bucket policy** Once the bucket is created, you'll be brought to a page that shows all of your current buckets (and those on our shared account). We'll have to edit our bucket's policy to allow ourselves proper access to any files stored there (e.g., read from bucket, write to bucket). To set these permissions... 1. Click on the name of your bucket to bring up additional options and settings. 2. Click the Permissions tab - 3. Scroll down to Bucket policy and click Edit. Paste the following policy, editing the bucket name "aws-wksp-test" to reflect your bucket's name. + 3. Scroll down to Bucket policy and click Edit. Paste the following policy, **editing the bucket name "MyAwesomeTeam-TitanicData"** to reflect your bucket's name ```json { @@ -115,8 +118,8 @@ Once the bucket is created, you'll be brought to a page that shows all of your c "s3:ListMultipartUploadParts" ], "Resource": [ - "arn:aws:s3:::aws-wksp-test", - "arn:aws:s3:::aws-wksp-test/*" + "arn:aws:s3:::MyAwesomeTeam-TitanicData", + "arn:aws:s3:::MyAwesomeTeam-TitanicData/*" ] } ] @@ -140,11 +143,12 @@ This setup ensures that your SageMaker operations will have the access needed wi :::::::::::::::::::::::::::::::::::::::::::::::: -7. **Upload files to the bucket**: +5. **Upload files to the bucket** - Navigate to the Objects tab of your bucket, then **Upload**. - **Add Files** (e.g., `titanic_train.csv`, `titanic_test.csv`) and click **Upload** to complete. -5. **Getting the S3 URI for your data**: + +6. **Take note of S3 URI for your data** - After uploading, click on a file to find its **Object URI** (e.g., `s3://titanic-dataset-test/test.csv`). We'll use this URI to load data into SageMaker later. ## S3 bucket costs @@ -186,7 +190,7 @@ Dataset sizes to consider: - 100 GB - 1 TB -**Hints**: +**Hints** - S3 storage cost: $0.023 per GB per month (us-east-1) - Data transfer cost (retrieval/deletion): $0.09 per GB (us-east-1 out to internet) - `GET` requests cost: $0.0004 per 1,000 requests (each model training will incur one `GET` request) From 81db18fe516c29c0bb2008ad73c194c8fcc72545 Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 16:54:50 -0600 Subject: [PATCH 07/16] Update Data-storage-setting-up-S3.md --- episodes/Data-storage-setting-up-S3.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/episodes/Data-storage-setting-up-S3.md b/episodes/Data-storage-setting-up-S3.md index e5466c5..6773370 100644 --- a/episodes/Data-storage-setting-up-S3.md +++ b/episodes/Data-storage-setting-up-S3.md @@ -86,7 +86,7 @@ For flexibility, scalability, and cost efficiency, store data in S3 and load it - **Access Control**: Disable ACLs (recommended). - **Public Access**: Turn on "Block all public access". - **Versioning**: Disable unless you need multiple versions of objects. - - **Tags**: Include suggested tags for easier cost tracking. Adding tags to your S3 buckets is a great way to track project-specific costs and usage over time, especially as data and resources scale up. While tags are required for hackathon participants, we suggest that all users apply tags to easily identify and analyze costs later. **Hackathon participants**: Use the following convention for your bucket name + - **Tags**: Adding tags to your S3 buckets is a great way to track project-specific costs and usage over time, especially as data and resources scale up. While tags are required for hackathon participants, we suggest that all users apply tags to easily identify and analyze costs later. **Hackathon participants**: Use the following convention for your bucket name - **Name**: Your Name - **ProjectName**: Your team's name - **Purpose**: Dataset name (e.g., TitanicData if you're following along with this workshop) From e7aae1db6a628627618632628c46da6ead65d864 Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 16:56:10 -0600 Subject: [PATCH 08/16] Update Data-storage-setting-up-S3.md --- episodes/Data-storage-setting-up-S3.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/episodes/Data-storage-setting-up-S3.md b/episodes/Data-storage-setting-up-S3.md index 6773370..873b054 100644 --- a/episodes/Data-storage-setting-up-S3.md +++ b/episodes/Data-storage-setting-up-S3.md @@ -185,12 +185,14 @@ Estimate the total cost of storing your project data in S3 for one month, using - Data will be deleted after the project concludes, incurring data retrieval and deletion costs Dataset sizes to consider: + - 1 GB - 10 GB - 100 GB - 1 TB **Hints** + - S3 storage cost: $0.023 per GB per month (us-east-1) - Data transfer cost (retrieval/deletion): $0.09 per GB (us-east-1 out to internet) - `GET` requests cost: $0.0004 per 1,000 requests (each model training will incur one `GET` request) From 9b9295b63a66147a76d1954dcfcfddd1b5929322 Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 17:00:56 -0600 Subject: [PATCH 09/16] Update SageMaker-notebooks-as-controllers.md --- episodes/SageMaker-notebooks-as-controllers.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/episodes/SageMaker-notebooks-as-controllers.md b/episodes/SageMaker-notebooks-as-controllers.md index 4268804..4f755ed 100644 --- a/episodes/SageMaker-notebooks-as-controllers.md +++ b/episodes/SageMaker-notebooks-as-controllers.md @@ -1,5 +1,5 @@ --- -title: "Notebooks as controllers" +title: "Notebooks as Controllers" teaching: 20 exercises: 10 --- @@ -21,9 +21,9 @@ exercises: 10 ## Step 2: Running Python code with SageMaker notebooks -Amazon SageMaker provides a managed environment to simplify the process of building, training, and deploying machine learning models. By using SageMaker, you can focus on model development without needing to manually provision resources or set up environments. Here, we'll guide you through setting up a Jupyter notebook instance and loading data to get started with model training and tuning in future episodes using the Titanic dataset in S3. +Amazon SageMaker provides a managed environment to simplify the process of building, training, and deploying machine learning models. By using SageMaker, you can focus on model development without needing to manually provision resources or set up environments. In this episode, we’ll guide you through setting up a **SageMaker notebook instance**—a Jupyter notebook hosted on AWS specifically for running SageMaker jobs. This setup allows you to efficiently manage and monitor machine learning workflows directly from a lightweight notebook controller. We’ll also cover loading data in preparation for model training and tuning in future episodes, using the Titanic dataset stored in S3. -> **Note**: We’ll use SageMaker notebook instances directly (instead of SageMaker Studio) for easier instance monitoring across users and streamlined resource management. +> **Note for hackathon attendees**: We’ll use SageMaker notebook instances (not the full SageMaker Studio environment) for simpler instance management and streamlined resource usage, ideal for collaborative projects or straightforward ML tasks. ## Using the notebook as a controller From fb8616c16150a29c7208f94ab94016e938c1ff11 Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 17:02:19 -0600 Subject: [PATCH 10/16] Update SageMaker-notebooks-as-controllers.md --- episodes/SageMaker-notebooks-as-controllers.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/episodes/SageMaker-notebooks-as-controllers.md b/episodes/SageMaker-notebooks-as-controllers.md index 4f755ed..91a8454 100644 --- a/episodes/SageMaker-notebooks-as-controllers.md +++ b/episodes/SageMaker-notebooks-as-controllers.md @@ -37,8 +37,6 @@ In this setup, the notebook instance functions as a **controller** to manage mor 5. Use SageMaker SDK to launch training and tuning jobs on powerful instances (covered in next episodes). 6. View and monitor training/tuning progress (covered in next episodes). -> **Note**: In upcoming episodes, we’ll dive into training and tuning ML models with SageMaker from this notebook instance. - ## Detailed procedure ### 1. Navigate to SageMaker From 99f68645dcf1c3f933dc072943115308a6a96b8b Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 17:10:59 -0600 Subject: [PATCH 11/16] Update SageMaker-notebooks-as-controllers.md --- episodes/SageMaker-notebooks-as-controllers.md | 11 +++++------ 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/episodes/SageMaker-notebooks-as-controllers.md b/episodes/SageMaker-notebooks-as-controllers.md index 91a8454..36c61e9 100644 --- a/episodes/SageMaker-notebooks-as-controllers.md +++ b/episodes/SageMaker-notebooks-as-controllers.md @@ -40,14 +40,13 @@ In this setup, the notebook instance functions as a **controller** to manage mor ## Detailed procedure ### 1. Navigate to SageMaker -- In the AWS Console, search for **SageMaker** and select **SageMaker - Build, Train, and Deploy Models**. -- Click **Set up for single user** (if prompted) and wait for the SageMaker domain to spin up. -- Under **S3 Resource Configurations**, select the S3 bucket you created earlier containing your dataset. +- In the AWS Console, search for **SageMaker**. Protip: select the star icon to save SageMaker as a bookmark in your AWS toolbar +- Select **SageMaker - Build, Train, and Deploy Models**. ### 2. Create a new notebook instance -- In the SageMaker menu, go to **Notebooks > Notebook instances**, then click **Create notebook instance**. -- **Notebook name**: Enter a name (e.g., `Titanic-ML-Notebook`). -- **Instance type**: Start with a small instance type, such as `ml.t3.medium`. You can scale up later as needed for intensive tasks, which will be managed by launching separate training jobs from this notebook. +- In the SageMaker left-side menu, click on **Notebooks**, then click **Create notebook instance**. +- **Notebook name**: Enter a name that reflects your notebook's project and purpose. Hackathon attendees should follow this convention: TeamName_YourName_Dataset_Purpose_Model(s) (e.g., `MyAwesomeTeam_ChrisEndemann_TitanicData_Train-Tune_XGBoost-NN`). +- **Instance type**: Start with a small instance type, such as `ml.t3.medium`. You can scale up later as needed for intensive tasks, which will be managed by launching separate training jobs from this notebook. For guidance on common instances for ML procedures, refer to this [spreadsheet](https://docs.google.com/spreadsheets/d/1uPT4ZAYl_onIl7zIjv5oEAdwy4Hdn6eiA9wVfOBbHmY/edit?usp=sharing). - **Permissions and encryption**: - **IAM role**: Choose an existing role or create a new one. **Hackathon attendees should select 'ml-sagemaker-use'**. The role should include the `AmazonSageMakerFullAccess` policy to enable access to AWS services like S3. - **Root access**: Choose to enable or disable root access. If you’re comfortable with managing privileges, enabling root access allows for additional flexibility in package installation. From c71a2e207b4a01f88f6d88c7615a424454d72de3 Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 17:15:18 -0600 Subject: [PATCH 12/16] Update SageMaker-notebooks-as-controllers.md --- episodes/SageMaker-notebooks-as-controllers.md | 6 ------ 1 file changed, 6 deletions(-) diff --git a/episodes/SageMaker-notebooks-as-controllers.md b/episodes/SageMaker-notebooks-as-controllers.md index 36c61e9..2cce9f4 100644 --- a/episodes/SageMaker-notebooks-as-controllers.md +++ b/episodes/SageMaker-notebooks-as-controllers.md @@ -60,12 +60,6 @@ Adding tags to your notebook instance helps track costs over time. ![Tag Setup Example](https://raw.githubusercontent.com/UW-Madison-DataScience/ml-with-aws-sagemaker/main/images/notebook_tags.PNG) Click **Create notebook instance**. It may take a few minutes for the instance to start. Once its status is **InService**, you can open the notebook instance and start coding. -::::::::::::::::::::::::::::::::::::: callout - - - -:::::::::::::::::::::::::::::::::::::::::::::::: - ### Managing training and tuning with the controller notebook After setting up the controller notebook, use the **SageMaker Python SDK** within the notebook to launch compute-heavy tasks on more powerful instances as needed. Examples of tasks to launch include: From 863fec5824da74ea9cda9251db0ba9aeed3a9489 Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 17:18:55 -0600 Subject: [PATCH 13/16] Update Data-storage-setting-up-S3.md --- episodes/Data-storage-setting-up-S3.md | 2 ++ 1 file changed, 2 insertions(+) diff --git a/episodes/Data-storage-setting-up-S3.md b/episodes/Data-storage-setting-up-S3.md index 873b054..0383672 100644 --- a/episodes/Data-storage-setting-up-S3.md +++ b/episodes/Data-storage-setting-up-S3.md @@ -61,6 +61,8 @@ For flexibility, scalability, and cost efficiency, store data in S3 and load it ## Recommended approach: S3 buckets +**Hackathon attendees**: When you setup your bucket for your actual project, note that you will only need one bucket for your whole team. Team members will have the proper permissions to access buckets on our shared account. + ### Summary steps to access S3 and upload your dataset 1. Log in to AWS Console and navigate to S3. From 57c7cee5331254e1e3542050cb4c8d185a0aa8fa Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 17:25:09 -0600 Subject: [PATCH 14/16] Update Data-storage-setting-up-S3.md --- episodes/Data-storage-setting-up-S3.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/episodes/Data-storage-setting-up-S3.md b/episodes/Data-storage-setting-up-S3.md index 0383672..1c24174 100644 --- a/episodes/Data-storage-setting-up-S3.md +++ b/episodes/Data-storage-setting-up-S3.md @@ -83,7 +83,7 @@ For flexibility, scalability, and cost efficiency, store data in S3 and load it 3. **Create a new bucket** - - Click **Create Bucket** and enter a unique name. **Hackathon participants**: Use the following convention for your bucket name: `TeamName-DatasetName` (e.g., `MyAwesomeTeam-TitanicData`). + - Click **Create Bucket** and enter a unique name, and note that bucket name must not contain uppercase characters. **Hackathon participants**: Use the following convention for your bucket name: `teamname_datasetname` (e.g., `myawesometeam-titanic`). - **Region**: Leave as is (likely `us-east-1` (US East N. Virginia)) - **Access Control**: Disable ACLs (recommended). - **Public Access**: Turn on "Block all public access". @@ -91,7 +91,7 @@ For flexibility, scalability, and cost efficiency, store data in S3 and load it - **Tags**: Adding tags to your S3 buckets is a great way to track project-specific costs and usage over time, especially as data and resources scale up. While tags are required for hackathon participants, we suggest that all users apply tags to easily identify and analyze costs later. **Hackathon participants**: Use the following convention for your bucket name - **Name**: Your Name - **ProjectName**: Your team's name - - **Purpose**: Dataset name (e.g., TitanicData if you're following along with this workshop) + - **Purpose**: Dataset name (e.g., titanic if you're following along with this workshop) ![Example of Tags for an S3 Bucket](https://raw.githubusercontent.com/UW-Madison-DataScience/ml-with-aws-sagemaker/main/images/bucket_tags.PNG){alt="Screenshot showing required tags for an S3 bucket"} - Click **Create Bucket** at the bottom once everything above has been configured @@ -102,7 +102,7 @@ Once the bucket is created, you'll be brought to a page that shows all of your c 1. Click on the name of your bucket to bring up additional options and settings. 2. Click the Permissions tab - 3. Scroll down to Bucket policy and click Edit. Paste the following policy, **editing the bucket name "MyAwesomeTeam-TitanicData"** to reflect your bucket's name + 3. Scroll down to Bucket policy and click Edit. Paste the following policy, **editing the bucket name "myawesometeam-titanic"** to reflect your bucket's name ```json { @@ -120,8 +120,8 @@ Once the bucket is created, you'll be brought to a page that shows all of your c "s3:ListMultipartUploadParts" ], "Resource": [ - "arn:aws:s3:::MyAwesomeTeam-TitanicData", - "arn:aws:s3:::MyAwesomeTeam-TitanicData/*" + "arn:aws:s3:::myawesometeam-titanic", + "arn:aws:s3:::myawesometeam-titanic/*" ] } ] From df0dfc55c70d0407a4e5cf2be735786a250b53c4 Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 17:32:40 -0600 Subject: [PATCH 15/16] Update SageMaker-notebooks-as-controllers.md --- episodes/SageMaker-notebooks-as-controllers.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/episodes/SageMaker-notebooks-as-controllers.md b/episodes/SageMaker-notebooks-as-controllers.md index 2cce9f4..076742d 100644 --- a/episodes/SageMaker-notebooks-as-controllers.md +++ b/episodes/SageMaker-notebooks-as-controllers.md @@ -40,12 +40,13 @@ In this setup, the notebook instance functions as a **controller** to manage mor ## Detailed procedure ### 1. Navigate to SageMaker -- In the AWS Console, search for **SageMaker**. Protip: select the star icon to save SageMaker as a bookmark in your AWS toolbar +- In the AWS Console, search for **SageMaker**. +- Protip: select the star icon to save SageMaker as a bookmark in your AWS toolbar - Select **SageMaker - Build, Train, and Deploy Models**. ### 2. Create a new notebook instance - In the SageMaker left-side menu, click on **Notebooks**, then click **Create notebook instance**. -- **Notebook name**: Enter a name that reflects your notebook's project and purpose. Hackathon attendees should follow this convention: TeamName_YourName_Dataset_Purpose_Model(s) (e.g., `MyAwesomeTeam_ChrisEndemann_TitanicData_Train-Tune_XGBoost-NN`). +- **Notebook name**: Enter a name that reflects your notebook's primary user (your name), dataset (titanic), purpose (train-tune), and models utilized (XGBoost-NN). **Hackathon attendees must use the following convention**: TeamName_YourName_Dataset_NotebookPurpose(s)_Model(s) (e.g., `MyAwesomeTeam_ChrisEndemann_Titanic_Train-Tune_XGBoost-NN`). - **Instance type**: Start with a small instance type, such as `ml.t3.medium`. You can scale up later as needed for intensive tasks, which will be managed by launching separate training jobs from this notebook. For guidance on common instances for ML procedures, refer to this [spreadsheet](https://docs.google.com/spreadsheets/d/1uPT4ZAYl_onIl7zIjv5oEAdwy4Hdn6eiA9wVfOBbHmY/edit?usp=sharing). - **Permissions and encryption**: - **IAM role**: Choose an existing role or create a new one. **Hackathon attendees should select 'ml-sagemaker-use'**. The role should include the `AmazonSageMakerFullAccess` policy to enable access to AWS services like S3. From 7d5315ebbb5e35a0f8c95f766933fc05b0731eea Mon Sep 17 00:00:00 2001 From: Chris Endemann Date: Tue, 5 Nov 2024 17:38:24 -0600 Subject: [PATCH 16/16] Update SageMaker-notebooks-as-controllers.md --- episodes/SageMaker-notebooks-as-controllers.md | 14 +++++++------- 1 file changed, 7 insertions(+), 7 deletions(-) diff --git a/episodes/SageMaker-notebooks-as-controllers.md b/episodes/SageMaker-notebooks-as-controllers.md index 076742d..e887b5f 100644 --- a/episodes/SageMaker-notebooks-as-controllers.md +++ b/episodes/SageMaker-notebooks-as-controllers.md @@ -48,18 +48,18 @@ In this setup, the notebook instance functions as a **controller** to manage mor - In the SageMaker left-side menu, click on **Notebooks**, then click **Create notebook instance**. - **Notebook name**: Enter a name that reflects your notebook's primary user (your name), dataset (titanic), purpose (train-tune), and models utilized (XGBoost-NN). **Hackathon attendees must use the following convention**: TeamName_YourName_Dataset_NotebookPurpose(s)_Model(s) (e.g., `MyAwesomeTeam_ChrisEndemann_Titanic_Train-Tune_XGBoost-NN`). - **Instance type**: Start with a small instance type, such as `ml.t3.medium`. You can scale up later as needed for intensive tasks, which will be managed by launching separate training jobs from this notebook. For guidance on common instances for ML procedures, refer to this [spreadsheet](https://docs.google.com/spreadsheets/d/1uPT4ZAYl_onIl7zIjv5oEAdwy4Hdn6eiA9wVfOBbHmY/edit?usp=sharing). +- **Platform identifier**: You can leave this as the default. - **Permissions and encryption**: - **IAM role**: Choose an existing role or create a new one. **Hackathon attendees should select 'ml-sagemaker-use'**. The role should include the `AmazonSageMakerFullAccess` policy to enable access to AWS services like S3. - - **Root access**: Choose to enable or disable root access. If you’re comfortable with managing privileges, enabling root access allows for additional flexibility in package installation. - - **Encryption key** (optional): Specify a KMS key for encrypting data at rest if needed. Otherwise, leave it blank. + - **Root access**: Enable root access to notebook. + - **Encryption key (optional)**: Specify a KMS key for encrypting data at rest if needed. Otherwise, leave it blank. - **Network (optional)**: Networking settings are optional. Configure them if you’re working within a specific VPC or need network customization. -- **Git repositories configuration (optional)**: Connect a GitHub repository to automatically clone it into your notebook. Note that larger repositories consume more disk space, so manage storage to minimize costs. For this workshop, we'll run a clone command from jupyter to get our repo setup. -- **Tags (required for hackathon attendees)**: Adding tags helps track and organize resources for billing and management. This is particularly useful when you need to break down expenses by project, task, or team. We recommend using tags like `Name`, `ProjectName`, and `Purpose` to help with future cost analysis. - - Please use the tags found in the below image to track your notebook's resource usage. -Adding tags to your notebook instance helps track costs over time. +- **Git repositories configuration (optional)**: You don't need to complete this configuration. Instead, we'll run a clone command from our notebook later to get our repo setup. This approach is a common strategy (allowing some flexiblity in which repo you use for the notebook. +- **Tags (required for hackathon attendees)**: Adding tags helps track and organize resources for billing and management. This is particularly useful when you need to break down expenses by project, task, or team. Please use the tags found in the below image to track your notebook's resource usage. ![Tag Setup Example](https://raw.githubusercontent.com/UW-Madison-DataScience/ml-with-aws-sagemaker/main/images/notebook_tags.PNG) -Click **Create notebook instance**. It may take a few minutes for the instance to start. Once its status is **InService**, you can open the notebook instance and start coding. + +- Click **Create notebook instance**. It may take a few minutes for the instance to start. Once its status is **InService**, you can open the notebook instance and start coding. ### Managing training and tuning with the controller notebook