Skip to content

Commit

Permalink
Browse files Browse the repository at this point in the history
  • Loading branch information
kyleoconnell-NIH committed Dec 5, 2023
2 parents 73fc623 + 59a4f5c commit b1d5699
Show file tree
Hide file tree
Showing 4 changed files with 10 additions and 9 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -62,8 +62,8 @@ There is some strategy to managing storage costs as well. When you have spun up
## **Spin up a Virtual Machine and run a workflow** <a name="vm"></a>
Virtual machines (VMs) on AWS are called Amazon Elastic Compute Cloud (EC2) and are like virtual computers that you access via SSH and which start as (nearly) completely blank slates. You have complete control over the VM configuration beginning with the [operating system](https://docs.aws.amazon.com/systems-manager/latest/userguide/prereqs-operating-systems.html#prereqs-os-linux). You can choose a variety of Linux flavors, as well as macOS and Windows. Virtual Machines are organized into [machine families](https://aws.amazon.com/ec2/instance-types/?trk=36c6da98-7b20-48fa-8225-4784bced9843&sc_channel=ps&sc_campaign=acquisition&sc_medium=ACQ-P|PS-GO|Brand|Desktop|SU|Compute|EC2|US|EN|Text&s_kwcid=AL!4422!3!536392622533!e!!g!!ec2%20instance%20types&ef_id=CjwKCAjw7IeUBhBbEiwADhiEMedsuBikka4KyMJjZdw2Qi63FwjhvKhPlmHr9EYefV3GIE14lRz-ixoCqWIQAvD_BwE:G:s&s_kwcid=AL!4422!3!536392622533!e!!g!!ec2%20instance%20types) with different functions, such as General Purpose, Compute Optimized, Accelerated Computing etc. You can also select machines with graphics processing units (GPUs), which run very quickly for some use cases, but also can cost more than most of the CPU machines. Billing occurs on a per second basis, and larger and faster machine types cost more per second. This is why it is important to stop or delete machines when not in use to minimize costs, and consider always using an [idle shutdown script](/docs/auto-shutdown-instance.md).

Many great resources exist on how to spin up, connect to, and work on a VM on AWS. The first place to direct you is the tutorial created by [NIH Common Data Fund](https://training.nih-cfde.org/en/latest/Cloud-Platforms/Introduction_to_Amazon_Web_Services/introtoaws3/). This tutorial expects that you will launch an instance and work with it interactively.
[Here](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-to-linux-instance.html) is the Amazon documentation for different ways to connect to an EC2 instance. In NIH staff will be able to connect from their [local terminal via SSH](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-linux-inst-ssh.html) or in the browser via [Session Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html). If you are an NIH-affiliated researcher, you will only be able to use the Session Manager. We wrote a [guide with screen shots](/docs/connect_to_EC2.md) that walks through SSH options.
Many great resources exist on how to spin up, connect to, and work on a VM on AWS. Start with
[this Amazon documentation](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-to-linux-instance.html) for different ways to connect to an EC2 instance. NIH staff will be able to connect from their [local terminal via SSH](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/connect-linux-inst-ssh.html) or in the browser via [Session Manager](https://docs.aws.amazon.com/systems-manager/latest/userguide/session-manager.html). If you are an NIH-affiliated researcher, you will only be able to use the Session Manager. We wrote a [guide with screen shots](/docs/connect_to_EC2.md) that walks through SSH options.

If you want to launch a Windows VM, check out this [tutorial](https://aws.amazon.com/getting-started/hands-on/launch-windows-vm/).

Expand Down
2 changes: 1 addition & 1 deletion docs/Install_AWSParallelCluster.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,7 +132,7 @@ export PATH="$HOME/mambaforge/bin:$PATH"
Output should display version number.

## 5. Move on to configuring and connecting to your cluster
Follow the [next instructions]([https://github.com/STRIDES/NIHCloudLabAWS/blob/main/docs/Configure_AWSParallelCluster.md) on how to configure your cluster.
Follow the [next instructions](/docs/Configure_AWSParallelCluster.md) on how to configure your cluster.



7 changes: 3 additions & 4 deletions tutorials/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,7 +26,7 @@ There are a lot of ways to run workflows on AWS. Here we list a few possibilitie

- The most simple is probably to spin up an EC2 instance, and run your command interactively, or using `screen` or, as a [startup script](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/user-data.html) attached as metadata. See the [GWAS tutorial](https://training.nih-cfde.org/en/latest/Bioinformatic-Analyses/GWAS-in-the-cloud) below for more info on how to run a pipeline using EC2.
- You could also run your pipeline via a SageMaker notebook, either by splitting out each command as a different block, or by running a workflow manager (Nextflow etc.). See [here](https://aws.amazon.com/blogs/machine-learning/scheduling-jupyter-notebooks-on-sagemaker-ephemeral-instances/) about scheduling a notebook to let it run longer. You can find some example notebooks in the [tutorials below](/tutorials/notebooks/).
- If you are running bioinformatic workflows, you can leverage the serverless functionality of AWS using [Amazon HealthOmics](https://aws.amazon.com/healthomics/). Read [this blog](https://aws.amazon.com/blogs/industries/automated-end-to-end-genomics-data-storage-and-analysis-using-amazon-omics/) for more detailed information and also see if any new blogs have come out. If you want to get some hands on experience with HealthOmics using Cloud Lab, follow [this on-demand workshop](https://catalog.workshops.aws/amazon-omics-end-to-end/en-US/001-getting-started/010-self-directed-workshop) from Amazon! Since you already have an account set up, skip directly to the _Workshop_ section and then you can decide if you want to complete the tutorial via the console, the CLI, or via Notebooks. If you go the notebook route, just spin up a notebook via [Sagemaker](https://github.com/STRIDES/NIHCloudLabAWS/tree/kao_update_docs#launch-a-sagemaker-notebook-). If you want to create a private workflow using Nextflow, you will need to migrate your containers to a private Amazon Elastic Container Registry (ECR). You can follow [this workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/76d4a4ff-fe6f-436a-a1c2-f7ce44bc5d17/en-US) to learn how that process works.
- If you are running bioinformatic workflows, you can leverage the serverless functionality of AWS using [Amazon HealthOmics](https://aws.amazon.com/healthomics/). Read [this blog](https://aws.amazon.com/blogs/industries/automated-end-to-end-genomics-data-storage-and-analysis-using-amazon-omics/) for more detailed information and also see if any new blogs have come out. If you want to get some hands on experience with HealthOmics using Cloud Lab, follow [this on-demand workshop](https://catalog.workshops.aws/amazon-omics-end-to-end/en-US/001-getting-started/010-self-directed-workshop) from Amazon! Since you already have an account set up, skip directly to the _Workshop_ section and then you can decide if you want to complete the tutorial via the console, the CLI, or via Notebooks. If you go the notebook route, just spin up a notebook via [Sagemaker](/docs/Jupyter_notebook.md). If you want to create a private workflow using Nextflow, you will need to migrate your containers to a private Amazon Elastic Container Registry (ECR). You can follow [this workshop](https://catalog.us-east-1.prod.workshops.aws/workshops/76d4a4ff-fe6f-436a-a1c2-f7ce44bc5d17/en-US) to learn how that process works.
- If you are using a workflow manager other than WDL, Nextflow, or CWL (e. g. Snakemake), use [AWS Genomics CLI](https://aws.amazon.com/genomics-cli/), which is a wrapper for genomics workflow managers and AWS Batch (serverless computing cluster). See our [docs](/docs/agc.md) on how to set up the AGC CLI for Cloud Lab. You can also just run Snakemake locally within a VM. See our [Pangolin tutorial](/tutorials/notebooks/pangolin) for one example.
- Finally, one benefit of the cloud is access to GPUs for workflow acceleration. While a lot of focus on GPU implementation will focus on AI/ML workflows, NVIDIA has software called Parabricks that will accelerate genomic workflows for pretty low costs. See the full list of command line options [here](https://docs.nvidia.com/clara/parabricks/3.7.0/index.html)) to see if your specific workflow is accelerated. The easiest way to run Parabricks right now is via AWS HealthOmics [Ready2Run workflows](https://docs.aws.amazon.com/omics/latest/dev/service-workflows.html), but to run it via EC2 see our [guide](/docs/parabricks.md).

Expand All @@ -49,7 +49,6 @@ Genome-wide association studies (GWAS) are large-scale investigations that analy
Medical imaging analysis requires the analysis of large image files and often requires elastic storage and accelerated computing.
- Most medical imaging analyses are done using notebooks, so we would recommend accessing this [Jupyter Notebook](/tutorials/notebooks/SpleenLiverSegmentation) and cloning it into SageMaker. The tutorial walks through image segmentation.
- [This Sagemaker Studio on-demand workshop](https://catalog.workshops.aws/hcls-aiml/en-US/chest-xrays-object-detection) has a nice section on building a model on medical imaging data.
- AWS has a nice intro to Machine Learning in a SageMaker notebook that predicts breast cancer from features extracted from image data, which walks you through both image analysis and some of the ML functionality of SageMaker, the notebook is found [here](https://github.com/aws/amazon-sagemaker-examples/blob/main/introduction_to_applying_machine_learning/breast_cancer_prediction/Breast%20Cancer%20Prediction.ipynb).
- You can also view this [AWS blog](https://aws.amazon.com/blogs/machine-learning/annotate-dicom-images-and-build-an-ml-model-using-the-monai-framework-on-amazon-sagemaker/) on how to annotate DICOM images and build a custom AI model with the data.
- You can learn to deidentify medical images following this AWS [tutorial](https://aws.amazon.com/blogs/machine-learning/de-identify-medical-images-with-the-help-of-amazon-comprehend-medical-and-amazon-rekognition/).

Expand All @@ -68,7 +67,7 @@ Single-cell RNA sequencing (scRNA-seq) is a technique that enables the analysis
NCBI BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics program provided by the National Center for Biotechnology Information (NCBI) that compares nucleotide or protein sequences against a large database to identify similar sequences and infer evolutionary relationships, functional annotations, and structural information. The NCBI team has written a version of BLAST for the cloud called ElasticBLAST, and you can read all about it [here](https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/index.html). Essentially, ElasticBLAST helps you submit BLAST jobs to AWS Batch and write the results back to S3. Feel free to experiment with the example tutorial in Cloud Shell, or try our [notebook version](/tutorials/notebooks/ElasticBLAST/run_elastic_blast.ipynb).

## **Protein Folding** <a name="af"></a>
You can run several protein folding algorithms including Alpha Fold on AWS. Because the databases are so large, the setup is normally pretty difficult, but AWS has created a StackFormation stack that automates spinning up all the resources necessary for running Alpha Fold and other protein folding algorithms. You can read about the AWS resources [here](https://aws.amazon.com/solutions/guidance/protein-folding-on-aws/), and view the GitHub page [here](https://github.com/aws-solutions-library-samples/aws-batch-arch-for-protein-folding). To get this to work, you will need to modify your security groups following [these instructions](https://docs.aws.amazon.com/fsx/latest/LustreGuide/limit-access-security-groups.html). You will also likely have to [grant additional permissions to the Role](https://github.com/STRIDES/NIHCloudLabAWS/blob/kao_update_docs/docs/update_sagemaker_role.md) that CloudFormation is using. If you get stuck, reach out to [email protected]. You can also run ESMFold [using this tutorial](https://catalog.workshops.aws/hcls-aiml/en-US/protein-analysis/esmfold).
You can run several protein folding algorithms including Alpha Fold on AWS. Because the databases are so large, the setup is normally pretty difficult, but AWS has created a StackFormation stack that automates spinning up all the resources necessary for running Alpha Fold and other protein folding algorithms. You can read about the AWS resources [here](https://aws.amazon.com/solutions/guidance/protein-folding-on-aws/), and view the GitHub page [here](https://github.com/aws-solutions-library-samples/aws-batch-arch-for-protein-folding). To get this to work, you will need to modify your security groups following [these instructions](https://docs.aws.amazon.com/fsx/latest/LustreGuide/limit-access-security-groups.html). You will also likely have to [grant additional permissions to the Role](/docs/update_sagemaker_role.md) that CloudFormation is using. If you get stuck, reach out to [email protected]. You can also run ESMFold [using this tutorial](https://catalog.workshops.aws/hcls-aiml/en-US/protein-analysis/esmfold).

## **Long Read Sequence Analysis** <a name="long"></a>
Long read DNA sequence analysis involves analyzing sequencing reads typically longer than 10 thousand base pairs (bp) in length, compared with short read sequencing where reads are about 150 bp in length.
Expand All @@ -77,7 +76,7 @@ Oxford Nanopore has a pretty complete offering of notebook tutorials for handlin
## **Drug Discovery** <a name="atom"></a>
The [Accelerating Therapeutics for Opportunities in Medicine (ATOM) Consortium](https://atomscience.org/) created a series of [Jupyter notebooks](https://github.com/ATOMScience-org/AMPL/tree/master/atomsci/ddm/examples/tutorials) that walk you through the ATOM approach to Drug Discovery.

These notebooks were created to run in Google Colab, so if you run them in AWS, you will need to make a few modification. First, we recommend you use a [Sagemaker Studio Notebook](https://github.com/STRIDES/NIHCloudLabAWS/blob/kao_update_docs/README.md#launch-a-sagemaker-notebook-) rather than a User-Managed notebook simply because it will have Tensorflow and other dependencies installed. Be sure to attach a GPU to your instance (T4 is fine). Also, you will need to comment out `%tensorflow_version 2.x` since that is a Colab-specific command. You will also need to `pip install` a few packages as needed. If you get errors with `deepchem`, try running `pip install --pre deepchem[tensorflow]` and/or `pip install --pre deepchem[torch]`. Also, some notebooks will require a Tensorflow kernel, while others require Pytorch. You may also run into a Pandas error, reach out to the ATOM GitHub developers for the best solution to this issue.
These notebooks were created to run in Google Colab, so if you run them in AWS, you will need to make a few modification. First, we recommend you use a [Sagemaker Studio Notebook](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-updated.html) rather than a User-Managed notebook simply because it will have Tensorflow and other dependencies installed. Be sure to attach a GPU to your instance (T4 is fine). Also, you will need to comment out `%tensorflow_version 2.x` since that is a Colab-specific command. You will also need to `pip install` a few packages as needed. If you get errors with `deepchem`, try running `pip install --pre deepchem[tensorflow]` and/or `pip install --pre deepchem[torch]`. Also, some notebooks will require a Tensorflow kernel, while others require Pytorch. You may also run into a Pandas error, reach out to the ATOM GitHub developers for the best solution to this issue.

## **Artificial Intelligence** <a name="ai"></a>
Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Artificial intelligence and machine learning algorithms are being applied to a variety of biomedical research questions, ranging from image classification to genomic variant calling. AWS has a long list of AI/ML tutorials available and we have compiled a list here. Most recent development focuses on generative AI including use cases such as extracting information from text, transforming speech to text, and generating images from text. Sagemaker Studio allows the user to rapidly create, test, and train generative AI models and has ready to use models all contained with [JumpStart](https://docs.aws.amazon.com/sagemaker/latest/dg/studio-jumpstart.html). These models range from foundation models, fine-tunable models, and task-specific solutions.
Expand Down
6 changes: 4 additions & 2 deletions tutorials/notebooks/SpleenLiverSegmentation/README.md
Original file line number Diff line number Diff line change
@@ -1,14 +1,16 @@
# Spleen Segmentation with Liver Example using NVIDIA Models and MONAI
_We have put together a training example that segments the Spleen in 3D CT Images. At the end is an example of combining both the Spleen model and the Liver model._

*Nvidia has changed some of the models used in this tutorial and it may crash, if you have issues, try commenting out the liver model, we are working on a patch*

## Introduction
Two pre-trained models from NVIDIA are used in this training, a Spleen model and Liver.
The Spleen model is additionally retrained on the medical decathlon spleen dataset: [http://medicaldecathlon.com/](http://medicaldecathlon.com/)
Data is not necessary to be downloaded to run the notebook. The notebook downloads the data during it's run.
The notebook uses the Python package [MONAI](https://monai.io/), the Medical Open Network for Artificial Intelligence.

- Spleen Model - [clara_pt_spleen_ct_segmentation_V2](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/med/models/clara_pt_spleen_ct_segmentation)
- Liver Model - [clara_pt_liver_and_tumor_ct_segmentation_V1](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/med/models/clara_pt_liver_and_tumor_ct_segmentation)
- Spleen Model - [clara_pt_spleen_ct_segmentation_V2](https://catalog.ngc.nvidia.com/orgs/nvidia/teams/monaitoolkit/models/monai_spleen_ct_segmentation)
- Liver Model - [clara_pt_liver_and_tumor_ct_segmentation_V1]()

## Outcomes
After following along with this notebook the user will be familiar with:
Expand Down

0 comments on commit b1d5699

Please sign in to comment.