Skip to content

Commit

Permalink
Update Data-storage-setting-up-S3.md
Browse files Browse the repository at this point in the history
  • Loading branch information
qualiaMachine authored Nov 4, 2024
1 parent 56db999 commit 6df83c4
Showing 1 changed file with 35 additions and 37 deletions.
72 changes: 35 additions & 37 deletions episodes/Data-storage-setting-up-S3.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,39 +19,39 @@ exercises: 5

::::::::::::::::::::::::::::::::::::::::::::::::

## Step 1: Data Storage
## Step 1: Data storage

> **Hackathon Attendees**: All data uploaded to AWS must relate to your specific Kaggle challenge, except for auxiliary datasets for transfer learning or pretraining. **DO NOT upload any restricted or sensitive data to AWS.**
## Options for Storage: S3 or EC2 Instance
## Options for storage: EC2 Instance or S3

Storing data in an **S3 bucket** is generally preferred for machine learning workflows on AWS, especially when using SageMaker.
### When to store data directly on EC2 (e.g., in Jupyter Notebook instance)

### What is an S3 Bucket?
Using EC2 for data storage can be a quick solution for certain temporary needs. An EC2 instance provides a virtual server environment with its own local storage, which can be used to store and process data directly on the instance. This method is suitable for temporary or small datasets and for one-off experiments that don’t require long-term data storage or frequent access from multiple services.

An **S3 bucket** is a container in Amazon S3 (Simple Storage Service) where you can store, organize, and manage data files. Buckets act as the top-level directory within S3 and can hold a virtually unlimited number of files and folders, making them ideal for storing large datasets, backups, logs, or any files needed for your project. You access objects in a bucket via a unique **S3 URI** (e.g., `s3://your-bucket-name/your-file.csv`), which you can use to reference data across various AWS services like EC2 and SageMaker.
#### Limitations of EC2 storage:
- **Scalability**: EC2 storage is limited to the instance’s disk capacity, so it may not be ideal for very large datasets.
- **Cost**: EC2 storage can be more costly for long-term use compared to S3.
- **Data Persistence**: EC2 data may be lost if the instance is stopped or terminated, unless using Elastic Block Store (EBS) for persistent storage.


### What is an S3 bucket?
Storing data in an **S3 bucket** is generally preferred for machine learning workflows on AWS, especially when using SageMaker. An S3 bucket is a container in Amazon S3 (Simple Storage Service) where you can store, organize, and manage data files. Buckets act as the top-level directory within S3 and can hold a virtually unlimited number of files and folders, making them ideal for storing large datasets, backups, logs, or any files needed for your project. You access objects in a bucket via a unique **S3 URI** (e.g., `s3://your-bucket-name/your-file.csv`), which you can use to reference data across various AWS services like EC2 and SageMaker.

::::::::::::::::::::::::::::::::::::: callout

### Benefits of Using S3 (Recommended for SageMaker)
### Benefits of using S3 (recommended for SageMaker and ML workflows)
The benefits will become more clear as you progress through these materials. However, to point out the most important benefits briefly...

- **Scalability**: S3 handles large datasets efficiently, enabling storage beyond the limits of an EC2 instance's disk space.
- **Cost Efficiency**: S3 storage costs are generally lower than expanding EC2 disk volumes. You only pay for the storage you use.
- **Separation of Storage and Compute**: You can start and stop EC2 instances without losing access to data stored in S3.
- **Integration with AWS Services**: SageMaker can read directly from and write back to S3, making it ideal for AWS-based workflows.
- **Easy Data Sharing**: Datasets in S3 are easier to share with team members or across projects compared to EC2 storage.
- **Cost-Effective Data Transfer**: When S3 and EC2 are in the same region, data transfer between them is free.
- **Cost efficiency**: S3 storage costs are generally lower than expanding EC2 disk volumes. You only pay for the storage you use.
- **Separation of storage and compute**: You can start and stop EC2 instances without losing access to data stored in S3.
- **Integration with AWS services**: SageMaker can read directly from and write back to S3, making it ideal for AWS-based workflows.
- **Easy data sharing**: Datasets in S3 are easier to share with team members or across projects compared to EC2 storage.
- **Cost-effective data transfer**: When S3 and EC2 are in the same region, data transfer between them is free.

::::::::::::::::::::::::::::::::::::::::::::::::

### When to Store Data Directly on EC2 (e.g., in Jupyter Notebook instance)

Using EC2 for data storage can be a quick solution for temporary needs, but **S3 is generally preferred** for scalability, cost-efficiency, and ease of integration across AWS services, especially for machine learning workflows. An EC2 instance provides a virtual server environment with its own local storage, which can be used to store and process data directly on the instance. This method is suitable for temporary or small datasets and for one-off experiments that don’t require long-term data storage or frequent access from multiple services.

#### Limitations of EC2 Storage:
- **Scalability**: EC2 storage is limited to the instance’s disk capacity, so it may not be ideal for very large datasets.
- **Cost**: EC2 storage can be more costly for long-term use compared to S3.
- **Data Persistence**: EC2 data may be lost if the instance is stopped or terminated, unless using Elastic Block Store (EBS) for persistent storage.

## Recommended Approach: Use S3 for Data Storage

Expand All @@ -61,40 +61,38 @@ For flexibility, scalability, and cost efficiency, store data in S3 and load it
- Scaling storage without reconfiguring the instance
- Seamless integration across AWS services

### Steps to Access S3 and Upload Your Dataset
### Summary steps to access S3 and upload your dataset

1. Log in to AWS Console and navigate to S3.
2. Create a new bucket or use an existing one.
3. Upload your dataset files.
4. Use the object URL to reference your data in future experiments.

### Adding Tags for Cost Tracking

Adding tags to your S3 buckets is a great way to track project-specific costs and usage over time, especially as data and resources scale up. While tags are required for hackathon participants, we suggest that all users apply tags to easily identify and analyze costs later.

![Example of Recommended Tags for an S3 Bucket](path/to/your-image.png){alt="Screenshot showing recommended tags for an S3 bucket, such as Team, Dataset, and Environment"}

Suggested tags include:
- **Team**: Your team name (e.g., `DataScienceClub`)
- **Dataset**: The specific dataset name (e.g., `CustomerRetention`)
- **Environment**: The type of environment (e.g., `Development`, `Production`)

### Detailed Procedure:
### Detailed procedure:

1. **Sign in to the AWS Management Console**:
- Log in to [AWS Console](https://aws.amazon.com/console/) using your credentials.

2. **Navigate to S3**:
- Type “S3” in the search bar and select **S3 - Scalable Storage in the Cloud**.

3. **Create a New Bucket (or Use an Existing One)**:
- Click **Create Bucket** and enter a unique name.
- **Hackathon participants**: Use a format like `TeamName-DatasetName` (e.g., `EmissionImpossible-CO2data`).
- **Region**: Leave as `us-east-1` (US East N. Virginia).
4. **Create a New Bucket (or Use an Existing One)**:
- Click **Create Bucket** and enter a unique name. **Hackathon participants**: Use the following convention for your bucket name: `TeamName-DatasetName` (e.g., `EmissionImpossible-CO2data`).
- **Region**: Leave as is (likely `us-east-1` (US East N. Virginia))
- **Access Control**: Disable ACLs (recommended).
- **Public Access**: Turn on "Block all public access".
- **Versioning**: Disable unless you need multiple versions of objects.
- **Tags**: Include suggested tags for easier cost tracking.
- **Tags**: Include suggested tags for easier cost tracking. Adding tags to your S3 buckets is a great way to track project-specific costs and usage over time, especially as data and resources scale up. While tags are required for hackathon participants, we suggest that all users apply tags to easily identify and analyze costs later. **Hackathon participants**: Use the following convention for your bucket name
- Name
- ProjectName
- Purpose

![Example of Recommended Tags for an S3 Bucket](path/to/your-image.png){alt="Screenshot showing recommended tags for an S3 bucket, such as Team, Dataset, and Environment"}

Check warning on line 90 in episodes/Data-storage-setting-up-S3.md

View workflow job for this annotation

GitHub Actions / Build Full Site

[missing file]: [Example of Recommended Tags for an S3 Bucket](path/to/your-image.png)

Suggested tags include:
- **Team**: Your team name (e.g., `EmissionPossible`)
- **Dataset**: The specific dataset name (e.g., `CO2`)
- **Environment**: The type of environment (e.g., `Development`, `Production`)
- **Encryption**: Use **Server-side encryption with Amazon S3 managed keys (SSE-S3)**.

4. **Upload Files to the Bucket**:
Expand Down

0 comments on commit 6df83c4

Please sign in to comment.