From 46e8313536280d0e4986b5e6f598f87d487b3e6c Mon Sep 17 00:00:00 2001 From: Kyle O'Connell Date: Wed, 13 Mar 2024 10:43:41 -0400 Subject: [PATCH 1/5] reformatted readme status, rm tutorials dir, updated markdown check links, updated readme with refesh and link reformat --- .markdown-link-check.json | 4 +- README.md | 220 ++++++++++-------- envs/environment.yaml | 49 ---- .../ATACseq_Tutorial1_Preprocessing.ipynb | 0 .../ATACseq_Tutorial2_PeakDetection.ipynb | 0 .../ATACseq_Tutorial3_Downstream.ipynb | 0 .../DL-gwas-gcp-example/1-d10-run-first.ipynb | 0 .../2-mse-run-second-in-jupyter.ipynb | 0 .../DL-gwas-gcp-example/README.md | 0 .../assets/00-create-new-notebook1.png | Bin .../assets/000-enable-apis.png | Bin .../assets/001-marketplace.png | Bin .../assets/002-name-minikf.png | Bin .../assets/003-pick-machine.png | Bin .../assets/004-notebook.png | Bin .../assets/01-create-new-notebook2.png | Bin .../assets/01-r2-create-new-notebook2.png | Bin .../assets/01-r3-create-new-notebook2.png | Bin .../assets/02-create-new-notebook3.png | Bin .../assets/03-open-notebook.png | Bin .../assets/04-upload-notebook-and-data.png | Bin .../assets/05-enable-kale.png | Bin .../assets/06-pipeline-parameters.png | Bin .../assets/07-pipeline-parameters-katib.png | Bin .../assets/08-pipeline-step.png | Bin .../assets/09-pipeline-metrics.png | Bin .../assets/10-setup-katib.png | Bin .../assets/11-r2-setup-job.png | Bin .../assets/11-setup-job.png | Bin .../DL-gwas-gcp-example/assets/12-compare.png | Bin .../assets/13-parallel-coords.png | Bin .../assets/14-successful-katib-run.png | Bin .../assets/click-set-cell-kind.png | Bin .../assets/dl-gwas-headline-1.png | Bin .../assets/dl-gwas-headline-2.png | Bin .../assets/enable-compute-engine.png | Bin .../assets/enable-service-management.png | Bin .../assets/n2-standard-16.png | Bin .../assets/old-11-r2-setup-job.png | Bin .../assets/old-x002-final-results-page.png | Bin .../assets/run-minikf-startup.png | Bin .../assets/service-management-api.png | Bin .../assets/service-usage-api.png | Bin .../DL-gwas-gcp-example/assets/ssh-link.png | Bin .../assets/stop-instance.png | Bin .../assets/updated-pipeline-params.png | Bin .../assets/x001-new-notebook.png | Bin .../assets/x002-final-results-page.png | Bin .../assets/x003-git-clone.png | Bin .../assets/x004-launch-terminal.png | Bin .../DL-gwas-gcp-example/assets/x006-skip.png | Bin .../DL-gwas-gcp-example/assets/x007-wait.png | Bin .../assets/x008-restart-ok.png | Bin .../assets/x009-preprocessing.png | Bin .../assets/xx-0002-pick-a-run.png | Bin .../assets/xx0001-navigate-to-experiment.png | Bin .../assets/xx0003-pick-pipeline-step.png | Bin .../assets/xx0006-step-logs.png | Bin .../assets/xx0007-step-io.png | Bin .../nb_assets/x002-final-results-page.png | Bin .../GWASCoatColor/GWAS_coat_color.ipynb | 0 .../GenAI/GCP_GenAI_Huggingface.ipynb | 0 .../GenAI/GCP_Pubmed_chatbot.ipynb | 0 .../GenAI/Gemini_Intro.ipynb | 0 .../GenAI/VertexAIStudioGCP.ipynb | 0 ...example_langchain_chat_llama_2_zeroshot.py | 0 ...mple_vectorsearch_chat_llama_2_zeroshot.py | 0 .../GenAI/langchain_on_vertex.ipynb | 0 .../GoogleBatch/.gitkeep | 0 .../nextflow/Part1_GBatch_Nextflow.ipynb | 0 .../nextflow/Part2_GBatch_Nextflow.ipynb | 0 .../nextflow/Part1_LS_API_Nextflow.ipynb | 0 .../nextflow/Part2_LS_API_Nextflow.ipynb | 0 .../snakemake/LS_API_Snakemake.ipynb | 0 .../SRADownload/SRA-Download.ipynb | 0 .../SpleenLiverSegmentation/README.md | 0 .../SpleenSeg_Pretrained-4_27.ipynb | 0 .../Spleen_best_metric_model_pretrained.pth | Bin .../elasticBLAST/run_elastic_blast.ipynb | 0 .../ncbi-stat-tutorial/STAT-tutorial.ipynb | 0 .../pangolin/pangolin_pipeline.ipynb | 0 tutorials/README.md | 128 ---------- 82 files changed, 125 insertions(+), 276 deletions(-) delete mode 100644 envs/environment.yaml rename {tutorials/notebooks => notebooks}/ATACseq/ATACseq_Tutorial1_Preprocessing.ipynb (100%) rename {tutorials/notebooks => notebooks}/ATACseq/ATACseq_Tutorial2_PeakDetection.ipynb (100%) rename {tutorials/notebooks => notebooks}/ATACseq/ATACseq_Tutorial3_Downstream.ipynb (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/1-d10-run-first.ipynb (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/2-mse-run-second-in-jupyter.ipynb (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/README.md (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/00-create-new-notebook1.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/000-enable-apis.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/001-marketplace.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/002-name-minikf.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/003-pick-machine.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/004-notebook.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/01-create-new-notebook2.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/01-r2-create-new-notebook2.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/01-r3-create-new-notebook2.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/02-create-new-notebook3.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/03-open-notebook.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/04-upload-notebook-and-data.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/05-enable-kale.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/06-pipeline-parameters.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/07-pipeline-parameters-katib.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/08-pipeline-step.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/09-pipeline-metrics.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/10-setup-katib.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/11-r2-setup-job.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/11-setup-job.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/12-compare.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/13-parallel-coords.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/14-successful-katib-run.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/click-set-cell-kind.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/dl-gwas-headline-1.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/dl-gwas-headline-2.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/enable-compute-engine.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/enable-service-management.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/n2-standard-16.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/old-11-r2-setup-job.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/old-x002-final-results-page.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/run-minikf-startup.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/service-management-api.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/service-usage-api.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/ssh-link.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/stop-instance.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/updated-pipeline-params.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/x001-new-notebook.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/x002-final-results-page.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/x003-git-clone.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/x004-launch-terminal.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/x006-skip.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/x007-wait.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/x008-restart-ok.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/x009-preprocessing.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/xx-0002-pick-a-run.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/xx0001-navigate-to-experiment.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/xx0003-pick-pipeline-step.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/xx0006-step-logs.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/assets/xx0007-step-io.png (100%) rename {tutorials/notebooks => notebooks}/DL-gwas-gcp-example/nb_assets/x002-final-results-page.png (100%) rename {tutorials/notebooks => notebooks}/GWASCoatColor/GWAS_coat_color.ipynb (100%) rename {tutorials/notebooks => notebooks}/GenAI/GCP_GenAI_Huggingface.ipynb (100%) rename {tutorials/notebooks => notebooks}/GenAI/GCP_Pubmed_chatbot.ipynb (100%) rename {tutorials/notebooks => notebooks}/GenAI/Gemini_Intro.ipynb (100%) rename {tutorials/notebooks => notebooks}/GenAI/VertexAIStudioGCP.ipynb (100%) rename {tutorials/notebooks => notebooks}/GenAI/example_scripts/example_langchain_chat_llama_2_zeroshot.py (100%) rename {tutorials/notebooks => notebooks}/GenAI/example_scripts/example_vectorsearch_chat_llama_2_zeroshot.py (100%) rename {tutorials/notebooks => notebooks}/GenAI/langchain_on_vertex.ipynb (100%) rename {tutorials/notebooks => notebooks}/GoogleBatch/.gitkeep (100%) rename {tutorials/notebooks => notebooks}/GoogleBatch/nextflow/Part1_GBatch_Nextflow.ipynb (100%) rename {tutorials/notebooks => notebooks}/GoogleBatch/nextflow/Part2_GBatch_Nextflow.ipynb (100%) rename {tutorials/notebooks => notebooks}/LifeSciencesAPI/nextflow/Part1_LS_API_Nextflow.ipynb (100%) rename {tutorials/notebooks => notebooks}/LifeSciencesAPI/nextflow/Part2_LS_API_Nextflow.ipynb (100%) rename {tutorials/notebooks => notebooks}/LifeSciencesAPI/snakemake/LS_API_Snakemake.ipynb (100%) rename {tutorials/notebooks => notebooks}/SRADownload/SRA-Download.ipynb (100%) rename {tutorials/notebooks => notebooks}/SpleenLiverSegmentation/README.md (100%) rename {tutorials/notebooks => notebooks}/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb (100%) rename {tutorials/notebooks => notebooks}/SpleenLiverSegmentation/monai_data/Spleen_best_metric_model_pretrained.pth (100%) rename {tutorials/notebooks => notebooks}/elasticBLAST/run_elastic_blast.ipynb (100%) rename {tutorials/notebooks => notebooks}/ncbi-stat-tutorial/STAT-tutorial.ipynb (100%) rename {tutorials/notebooks => notebooks}/pangolin/pangolin_pipeline.ipynb (100%) delete mode 100644 tutorials/README.md diff --git a/.markdown-link-check.json b/.markdown-link-check.json index d8132d2..d217923 100644 --- a/.markdown-link-check.json +++ b/.markdown-link-check.json @@ -19,8 +19,8 @@ "replacement": "https://github.com/STRIDES/NIHCloudLabGCP/tree/main/docs" }, { - "pattern": "^/tutorials", - "replacement": "https://github.com/STRIDES/NIHCloudLabGCP/tree/main/tutorials" + "pattern": "^/notebooks", + "replacement": "https://github.com/STRIDES/NIHCloudLabGCP/tree/main/notebooks" }, { "pattern": "^/images", diff --git a/README.md b/README.md index 2519524..415ded9 100644 --- a/README.md +++ b/README.md @@ -1,102 +1,128 @@ +# GCP Tutorial Resources ->This repository falls under the NIH STRIDES Initiative. STRIDES aims to harness the power of the cloud to accelerate biomedical discoveries. To learn more, visit https://cloud.nih.gov. -> -# NIH Cloud Lab for GCP +_We have pulled together a variety of tutorials here from disparate sources. Some use Compute Engine, others use Vertex AI notebooks and others use only managed services. Tutorials are organized by research method, but we try to designate what GCP services are used as well to help you navigate._ --------------------------------- - -There are a lot of resources available to learn about GCP, which can be overwhelming. NIH Cloud Lab’s goal is to make cloud very easy and accessible for you, so that you can spend less time on administrative tasks and focus on your research. - -Use this repository to learn about how to use GCP by exploring the linked resources and walking through the tutorials. If you are a beginner, we suggest you begin with this Jumpstart section. If you already have foundational knowledge of GCP and cloud, feel free to skip ahead to the [tutorials](/tutorials/) section for in-depth examples of how to run specific workflows such as genomic variant calling and medical image analysis. - - ## Overview of Page Contents -+ [Getting Started](#gs) -+ [Overview](#ov) -+ [Command Line Tools](#cli) -+ [Google Cloud Marketplace](#mark) -+ [Ingest and Store Data](#sto) -+ [Virtual Machines](#vm) -+ [Disk Images](#im) -+ [Jupyter Notebooks](#jup) -+ [Creating Conda Environments](#co) -+ [Managing Containers](#dock) -+ [Serverless Functionality](#ser) -+ [Clusters](#clu) -+ [Billing and Benchmarking](#bb) -+ [Cost Optimization](#cost) -+ [Managing Your Code](#code) -+ [Getting Support](#sup) -+ [Additional Training](#tr) - -## **Getting Started** -You can learn a lot of what is possible on GCP in the [GCP Getting Started Page](https://cloud.google.com/getting-started). There you can find links to [documentation](https://cloud.google.com/docs) for common GCP tools and resources, and short videos on various subjects called [cloud minute](https://www.youtube.com/playlist?list=PLIivdWyY5sqIij_cgINUHZDMnGjVx3rxi). You can also view the following [Google Cloud Essentials Playlist](https://www.youtube.com/playlist?list=PLIivdWyY5sqKh1gDR0WpP9iIOY00IE0xL) or [Cloud Bytes Playlist]( https://www.youtube.com/playlist?list=PLIivdWyY5sqIQ4_5PwyyXZVdsXr3wYhip) from Google to help you get started. - -Even with a wealth of resources it can be difficult to know where to start on learning how to use the cloud. To help you, we thought through some of the most common tasks you will encounter doing cloud-enabled research and gathered tutorials and guides specific to those topics. We hope the following materials are helpful as you explore migrating your research to the cloud. Please feel free to submit [issues](/issues) or even contribute to the codebase directly. If you want additional resources on bioinformatics in GCP, check out this [other repository](https://github.com/lynnlangit/gcp-for-bioinformatics) by Lynn Langit. - -Before going any further, **make sure you can open your GCP project**. For Intramural NIH staff, follow [this guide](/docs/open_GCP_project.md) if you have any trouble. If you are an external NIH-affiliated researcher, follow [this guide](/docs/open_GCP_project.md). - -:boom: *If you are looking for resources on **Generative AI**, please go to our [tutorials section](/tutorials#artificial-intelligence-and-machine-learning-)*. :boom: - -## **Overview** -There are three primary ways you can run analyses using GCP: using virtual machines, Jupyter Notebook instances, and Managed services. We give a brief overview of each of these here and go into more detail in the sections below. [Virtual machines](https://cloud.google.com/compute) are like your desktop computers, but you access them through the cloud console and you get to decide what resources are on that computer such as CPU and memory. In GCP, the platform that hosts these virtual machines is called Compute Engine. You access VMs via SSH (secure remote connections), either through the console or via the command line. Jupyter Notebook instances are virtual machines with Jupyter Lab preloaded onto them. On GCP these are run through [Vertex AI](https://cloud.google.com/vertex-ai) or the new [Colab Enterprise](https://cloud.google.com/colab/docs/introduction). You decide what kind of virtual machine you want to 'spin up' and then you can run Jupyter notebooks on that virtual machine. You access these notebooks through the console similar to the way you interact with Jupyter locally. Finally, Managed Services are services allow you to run things, an analysis, an app, a website, and not have to deal with your own servers (VMs). There are still servers running somewhere, you just don't have to manage them. All you have to do is call a command that runs your analysis in the background and copies the output files to a storage bucket. The most common managed serverless feature you will work with here is the [Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest), although in the near future this functionality will migrate to [Google Batch](https://cloud.google.com/batch/docs/get-started). Typically, these workflows are run from the command line, either from a VM, Cloud Shell, or your local terminal. - -## **Command Line Tools** -One other task that will enable all that comes below is installing and configuring the GCP SDK command line tools, which will allow you to interact with instances or Google Storage buckets from your local terminal. Command line interface (CLI) tools are those that you use directly in a terminal/shell as opposed to clicking within a graphical user interface (UI). Instructions for installing the CLI can be found [here](https://cloud.google.com/sdk/docs/install). Along the same lines, it is important to familiarize yourself with the two main CLI commands: [gcloud](https://cloud.google.com/sdk/docs/cheatsheet) and [gsutil](https://cloud.google.com/storage/docs/quickstart-gsutil). There are also other commands you may come across in some circumstances like [kubectl](https://kubernetes.io/docs/reference/kubectl/overview/). If you have trouble installing the CLI on your local computer, you can still use the same commands from a virtual machine or from [Cloud Shell](https://cloud.google.com/shell), which is a terminal environment available to users on the GCP console. - -## **Google Cloud Marketplace** -The [GCP Marketplace](https://console.cloud.google.com/marketplace?_ga=2.218253598.688556330.1669914945-1153994297.1666892959) is a platform where you can search for and launch pre-configured solutions such as Machine Images. Examples of images you may launch would be those with [enhanced security](https://console.cloud.google.com/marketplace/product/cis-public/cis-centos-linux-7-level-1) or ones opimized for various tasks like [machine learning](https://console.cloud.google.com/marketplace/browse?ga=2.218253598.688556330.1669914945-1153994297.1666892959&q=machine%20learning), or [accelerated genomics](https://console.cloud.google.com/marketplace/product/nvidia-ngc-public/nvidia-clara-parabricks-new-plan). One of our tutorials showcases using a Marketplace solution called miniKF, which you can test out [here](/tutorials/notebooks/DL-gwas-gcp-example/). - -## **Ingest and Store Data using Google Cloud Storage** -Data can be stored in two places on the cloud, either in a cloud storage bucket, which on GCP is called Google Cloud Storage (GCS), or on an instance, which usually has Elastic Block Storage. In general, you want to keep your compute and storage separate, so you should aim to storage data in GCS for access, then only copy the data you need to a particular instance to run an analysis, then copy the results back to GCS. In addition, the data on an instance is only available when the instance is running, whereas the data in GCS is always available. [Here](https://cloud.google.com/storage/docs/quickstart-console) is a great tutorial on how to use GCS and is worth going through to learn how it all works. - -We also wanted to give you a few other tips that may be helpful when it comes to moving and storing data. If your end goal is to move data to a GCS bucket, you can do that using the user interface and clicking the `Upload Files` button from within a bucket, or you can use the command line by typing `gsutil cp `. Of course, you need to first create the bucket, which you can do using the instructions in the tutorial linked above, by typing `gsutil mb `. If you want to move a whole folder, then use the recursive flag: `gsutil cp -r `. The same applies whether moving data from your local directory or from a VM. Likewise, you can move data from GCS back to your local machine or your VM with `gsutil cp `. To multithread a gsutil action, use the `-m` flag, for example: `gsutil -m -r `. Finally, if you are trying to move data from the Short Read Archive (SRA) to an instance, or to GCS, you can use [fasterq_dump](https://github.com/glarue/fasterq_dump) from the SRA toolkit. The best way is probably to install on an instance, then copy the data to local storage on your instance, then optionally copy it to GCS for backup or use elsewhere. See our [notebook](/tutorials/notebooks/SRADownload/) for an example. - -Another important aspect of managing data storage costs is to be strategic about storing data in GCP vs. on your instances. When you have spun up a VM, you have already paid for the storage on the VM since you are paying for the size of the disk (storage space), whereas bucket storage is charged based on how much data you put in GCS. This is something to think about when copying results files back to GCS for example. If they are not files you will need later, then leave them on the VM's block storage and save your money on more important data to put into GCS. Just make sure you are always either backing up by creating a machine image (see below) or keeping data you can't live without in object-based cloud storage. And make sure you delete the instance when you are done with it, because storing old virtual machines can get expensive. - -## **Spin up a Virtual Machine and run a workflow** -Google and other sources have a lot of great resources on how to spin up and use a VM. The first place we will point you is to the [NIH Common Data Fund resource](https://training.nih-cfde.org/en/latest/Cloud-Platforms/Introduction-to-GCP/gcp2/), which lays out how to spin up a VM, SSH into it, make a bucket, and move data around similar to what we did in the example notebooks above. One thing worth noting is that the NIH tutorial has you SSH into your instance using a gcloud command in the shell. This is one way to SSH in, but it is a lot easier to just double click the SSH button next to the instance name on the Compute Instances page. You can find the GCP specific documentation on how to spin up an instance [here](https://cloud.google.com/compute/docs/instances/create-start-instance#console). If you want to start a Windows VM, read [the documentation](https://cloud.google.com/compute/docs/quickstart-windows). Windows VMs can have extra costs associated with them, so make sure you understand what you are getting into before starting. We encourage you to follow our [auto-shutdown instructions](/docs/compute-engine-idle-shutdown.md) to prevent leaving machines running. - -## **Disk Images** -Part of the power of virtual machines is that they offer a blank slate for you to configure as desired. However, sometimes you want to recycle data or installed programs for your next VM instead of having to reinvent the wheel. One solution to this issue is using disk (or machine) images, where you copy your existing virtual disk to a [Machine Image](https://cloud.google.com/compute/docs/machine-images) which can serve as a backup or can be used to launch a new instance with the programs and data from a previous instance. - -## **Launch a Jupyter Notebook** -Jupyter notebooks are web based interactive code platforms. On GCP, notebooks are launched through the Vertex AI platform. Here we are going to launch a JupyterLab environment on GCP, and then import a custom notebook from this repo to walk through running commands in Vertex AI. Vertex AI is Google's current approach to Machine Learning and Artificial Intelligence workflows. You can read more about a [Vertex AI Overview](https://cloud.google.com/vertex-ai) and [technical documentation and tutorials](https://cloud.google.com/vertex-ai/docs). - -To spin up a Notebook instance and import an example training notebook, follow our [guide here](/docs/vertexai.md). - -If you want to practice using the terminal or review BASH commands in Jupyter, look at [this module](https://github.com/NIGMS/IntroBioinformaticsDartmouth) from Dartmouth developed for the NIGMS Sandbox. - -You can also now use [Colab Enterprise](https://cloud.google.com/colab/docs/introduction) from within VertexAI, which allows you to run Colab notebooks within Google Cloud. - -## **Creating a Conda Environment** -Virtual environments allow you to manage package versions without having package conflicts. For example, if you needed Python 3 for one analysis, but Python 2.7 for another, you could create separate environments to use the two versions of Python. One of the most popular package managers used for creating virtual environments is the [conda package manager](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/environments.html#:~:text=A%20conda%20environment%20is%20a,NumPy%201.6%20for%20legacy%20testing). - -Mamba is a reimplementation of conda written in C++ and runs much faster than legacy conda. Conda environments are created using configuration files in yaml format (yaml is a type of configuration file), or by listing the programs to install after the initial conda command. To create a conda environment on a virtual machine or Jupyter instance, follow [this guide](/docs/create_conda_env.md). We walk you through generating a generic conda environment on a Virtual Machine, as well as how to create a custom kernel for a notebook instance. - -## **Managing Containers with Google Artifact Registry** -You can host containers within either the older Google Container Registry, or else in the newer Google Artifact Registry, which can host containers as well as other artifacts. We outline how to build a container, push to an Artifact Registry, and pull to a compute environment in our [docs](/docs/containers.md). - -## **Serverless Functionality** - -Serverless services are those that allow you to run things, an analysis, an app, a website etc., and not have to deal with servers (VMs). There are still servers running somewhere, you just don't have to manage them. All you have to do is call a command that runs your analysis in the background and copies the output files to a storage bucket. The most relevant serverless feature on GCP to Cloud Lab users (especially 'omics' analyses) is [Google Batch](https://cloud.google.com/batch/docs/get-started). You can walk through a tutorial of this service using this [notebook](/tutorials/notebooks/GoogleBatch/nextflow) Those doing health informatics should look into the [Google Cloud Healthcare Data Engine](https://cloud.google.com/healthcare). You can find a variety of other tutorials from the [NIGMS Sandbox](https://github.com/NIGMS/NIGMS-Sandbox) as well as this [Google tutorial](https://cloud.google.com/batch/docs/nextflow). - -## **Clusters** -One great thing about the cloud is its ability to scale with demand. When you submit a job to a traditional computing cluster (a set of computers that work together to execute a function), you have to specify up front how many CPUs and how much memory you want to allocate to your job, and you may over- or under-utilize these resources. Alternatively, on the cloud you can leverage a feature called autoscaling, where the compute resources will scale up or down with the demand. This is more efficient and keeps costs down when demand is low, but prevents latency when demand is high (e.g., a whole hackathon submitting jobs to a cluster). For most users of Cloud Lab, the best way to benefit from autoscaling is to use an API like the Life Sciences API or Google BATCH, but in some cases, maybe for a whole lab group or a large project, it may make sense to spin up a [Kubernetes cluster](https://cloud.google.com/kubernetes-engine/docs/quickstart) and submit jobs to the cluster using a workflow manager like [Snakemake](https://snakemake.readthedocs.io/en/stable/getting_started/migration.html#command-line-interface). [One of our tutorials](/tutorials/notebooks/DL-gwas-gcp-example) uses a marketplace solution to deploy a small Kubernetes cluster and then run AI models using Kubeflow. Finally, you can spin up SLURM clusters on GCP, which is easiest to do through the Cloud Shell or a VM, but can be accessed via SSH from your terminal. Instructions are [here](https://cloud.google.com/architecture/deploying-slurm-cluster-compute-engine). - -## **Billing and Benchmarking** -Many Cloud Lab users are interested in understanding how to estimate the price of a large-scale project using a reduced sample size. Generally, you should be able to benchmark with a few representative samples to get an idea of the time and cost required for a larger scale project. In terms of cost, a great way to estimate costs is to use the [GCP cost calculator](https://cloud.google.com/products/calculator#id=b64e9c4f-c637-432f-8e2c-4a7238dde0f2) for an initial figure. The calculator is a tool that estimates costs based on location, VM type/size, and runtime. Then, you can run some benchmarks and double check that everything is acting as you expect. For example, if you know that your analysis on your on-premises cluster takes 4 hours to run for a single sample with 12 CPUs, and that each sample needs about 30 GB of storage to run a workflow, then you can extrapolate to determine total costs using the calculator (e.g., compute engine + cloud storage). - -To get a more precise estimate, you can use assign labels to your workflows, then generate a report for a specific label. You can learn how to do that in our [docs](docs/How%20to%20Create%20Labels%20and%20GCP%20Billing%20Report.docx.pdf). Note that it can take up to 24 hours to update the billing account, so you may need to wait a few hours after running an analysis before you will have an accurate report. You can also watch this helpful [video](https://anvilproject.org/learn/anvil-mooc/cloud-costs) from the AnVIL project to learn more about Cloud Costs. - -## **Cost Optimization** -As you go through all the tutorials, you can keep costs down by stopping and/or deleting resources (e.g., VMs or Buckets) that you no longer need. Another strategy is to ensure that you are using all the compute resources you have provisioned. If you spin up a VM with 16 CPUs, you can see if they are all being utilized using [Cloud Monitoring](https://cloud.google.com/monitoring#section-1). If you are only really using 8 CPUs for example, then just change your machine size to fit the analysis. You can also play with [Spot](https://cloud.google.com/spot-vms) instances for running workflows and end up saving a lot of money. Finally, you can create [Budget Alerts](https://cloud.google.com/billing/docs/how-to/budgets) to help you track your budget. - -## **Managing Your Code with the Google Source Repository** -You if want an alternative to GitHub built directly into your cloud console, you can try the Google Cloud Source Repository. It uses all the normal git commands and will feel very familiar if you are used to using GitHub. If you want more background, or want to try it out, view our [guide](/docs/cloud_source_repository.md) in the docs. - -## **Getting Support** -As part of the NIH Cloud Lab sign-up process, you will be added to the Cloud Lab Teams channel. Feel free to message others in the group for support and our team will also chime in and help. For NIH users, you can also submit a support ticket to the NIH IT Service Desk. For all other users, you can reach out to the Cloud Lab email with questions at `CloudLab@nih.gov`. Please be sure that your tickets/emails have a clear subject line, such as "GCP help with the Life Sciences API". For issues that the NIH Cloud Lab Support Team is unable to resolve, you can reach out to GCP enterprise support directly by clicking the question mark in the top right part of the console and opening a support case. ++ [Biomedical Workflows on GCP](#bds) ++ [Artificial Intelligence and Machine Learning](#ml) ++ [Medical Imaging](#mi) ++ [Download SRA Data](#sra) ++ [Variant Calling](#vc) ++ [VCF Query](#vcf) ++ [GWAS](#gwas) ++ [Proteomics](#pro) ++ [RNAseq and Transcriptome Assembly](#rna) ++ [scRNAseq](#sc) ++ [ATACseq and scATACseq](#atac) ++ [Methylseq](#ms) ++ [Metagenomics](#meta) ++ [Multiomics and Biomarker Analysis](#mo) ++ [BLAST+](#bl) ++ [Long Read Sequencing Analysis](#long) ++ [Drug Discovery](#atom) ++ [Using Google Batch](#gbatch) ++ [Using the Life Sciences API (depreciated)](#lsapi) ++ [Public Data Sets](#pub) + +## **Biomedical Workflows on GCP** +There are a lot of ways to run workflows on GCP. Here we list a few possibilities each of which may work for different research aims. As you walk through the various tutorials below, think about how you could possibly run that workflow more efficiently using one of the other methods listed here. + +- The simplest method is probably to spin up a Compute Engine instance, and run your command interactively, or using `screen` or, as a [startup script](https://cloud.google.com/compute/docs/instances/startup-scripts/linux) attached as metadata. +- You could also run your pipeline via a Vertex AI notebook, either by splitting out each command as a different block, or by running a workflow manager (Nextflow etc.). [Schedule notebooks](https://codelabs.developers.google.com/vertex_notebook_executor#0) to let them run longer. +You can find a nice tutorial for using managed notebooks [here](https://codelabs.developers.google.com/vertex_notebook_executor#0). Note that there is now a difference between `managed notebooks` and `user managed notebooks`. The `managed notebooks` have more features and can be scheduled, but give you less control about conda environments/install. +- You can interact with [Google Batch](https://cloud.google.com/batch/docs/get-started), or the [Google Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest) using a workflow manager like [Nextflow](https://cloud.google.com/life-sciences/docs/tutorials/Nextflow), [Snakemake](https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/googlebatch.html), or [Cromwell](https://github.com/GoogleCloudPlatform/rad-lab/tree/main/modules/genomics_cromwell). We currently have example notebooks for both [Nextflow and Snakemake that use the Life Sciences API](/notebooks/LifeSciencesAPI/), as well as [Google Batch with Nextflow](/notebooks/GoogleBatch/Nextflow) as well as a [local version of Snakemake run via Pangolin](/notebooks/pangolin). +- You may find other APIs better suite your needs such as the [Google Cloud Healthcare Data Engine](https://cloud.google.com/healthcare). +- Most of the notebooks below require just a few CPUs. Start small (maybe 4 CPUs), then scale up as needed. Likewise, when you need a GPU, start with a smaller or older generation GPU (e.g. T4) for testing, then switch to a newer GPU (A100/V100) once you know things will work or you need more compute power. + +## **Artificial Intelligence and Machine Learning** +Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Machine learning on GCP generally occurs within VertexAI. You can learn more about machine learning on GCP at this [Google Crash Course](https://developers.google.com/machine-learning/crash-course). For hands-on examples, try out [this module](https://github.com/NIGMS/COVIDMachineLearningSFSU) developed by San Francisco State University or [this one from the University of Arkasas](https://github.com/NIGMS/MachineLearningUA) developed for the NIGMS Sandbox Project. + +Now that the age of **Generative AI** (Gen AI) has arrived, Google has released a host of Gen AI offerings within the Vertex AI suite. Some examples of what generative AI models are capable of are extracting wanted information from text, transforming speech into text, generating images from descriptions and vice versa, and much more. Vertex AI's [Vertex AI Studio](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/generative-ai-studio) console allows the user to rapidly create, test, and train generative AI models on the cloud in a safe and secure setting, see our overview in [this tutorial](/notebooks/GenAI/VertexAIStudioGCP.ipynb). The studio also has ready-to-use models all contained within the [Model Garden](https://cloud.google.com/vertex-ai/docs/start/explore-models). These models range from foundation models, fine-tunable models, and task-specific solutions. +- To learn more about Gen AI on GCP take a look at our [GenAI tutorials](/notebooks/GenAI) that go over several GCP products such as [Gemini](/notebooks/GenAI/Gemini_Intro.ipynb) and [Vector Search](/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb) and other tools like [Langchain](/notebooks/GenAI/langchain_on_vertex.ipynb) and [Huggingface](/notebooks/GenAI/GCP_GenAI_Huggingface.ipynb) to deploy, train, prompt, and implement techniques like [Retrieval-Augmented Generation (RAG)](/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb) to GenAI models. +- Google also provides many generative AI tutorials hosted on [GitHub](https://github.com/GoogleCloudPlatform/generative-ai/tree/main). Some example they provide are under [language here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/language). + +## **Medical Image Segmentation** +Medical image analysis is the application of computational algorithms and techniques to extract meaningful information from medical images for diagnosis, treatment planning, and research purposes. Medical image analysis requires large image files and often elastic storage and accelerated computing. +- Most medical imaging analyses are done in notebooks, so we would recommend downloading the Jupyter Notebook from [here](/notebooks/SpleenLiverSegmentation) and then importing or cloning it into VertexAI. The tutorial walks through image segmentation using the Monai framework. +- You can also request early access to the new [Google Medical Imaging Suite](https://cloud.google.com/medical-imaging) to see if it would fit your use case. + +## **Download Data From the Sequence Read Archive (SRA)** +Next Generation genetic sequence data is housed in the NCBI Sequence Read Archive (SRA). You can access these data using the SRA Toolkit. We walk you through this using [this notebook](/notebooks/SRADownload), including how to use BigQuery to generate your list of Accessions. You can also use BigQuery to create a list of accessions for download using [this setup guide](https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery/) and this [query guide](https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery-examples/). Additional example notebooks can be found at this [NCBI repo](https://github.com/ncbi/ASHG-Workshop-2021). In particular, we recommend this notebook(https://github.com/ncbi/ASHG-Workshop-2021/blob/main/1_Basic_BigQuery_Examples.ipynb), which goes into more detail on using BigQuery to access the results of the SRA Taxonomic Analysis Tool, which often differ from the user input species name due to contamination, error, or due to samples being metagenomic in nature. Further, [this notebook](https://github.com/ncbi/ASHG-Workshop-2021/blob/main/2_Array_Examples.ipynb) does a deep dive on parsing the BigQuery results and may give you some good ideas on how to search for samples from SRA. The SRA metadata and taxonomy analyses are in separate BigQuery tables, you can learn how to join those two tables using SQL from [this Powerpoint](https://github.com/NCBI-Codeathons/NCGAS-cloud-workshop/blob/main/5_BigQuery.pptx) or from our tutorial [here](/notebooks/ncbi-stat-tutorial/). Finally, NCBI released [this workshop](https://github.com/ncbi/workshop-asm-ngs-2022/wiki) that walks through a wide variety of BigQuery applications with NCBI datasets. + +## **Variant Calling** +Genomic variant calling is the process of identifying and characterizing genetic variations from DNA sequencing data to understand differences in an individual's genetic makeup. +- This [Google tutorial](https://cloud.google.com/life-sciences/docs/tutorials/gatk) shows you how to run the GATK Best Practices pipeline for genomic variant calling using the Life Sciences API. There is a section about increasing your account quotas, you can skip that. You could also run GATK using any of the workflow managers and submitting to the Life Sciences API. +- One tutorial specific to somatic variant calling comes from the Sheffield Bioinformatics Core [here](https://sbc.shef.ac.uk/somatic-variants/index.nb.html). It runs on Galaxy, but can be adapted to run in GCP. At the very least, the [data](https://drive.google.com/drive/folders/1RhrmfW3vMhPwAiBGdFIKfINWMsdvIG6E) may prove useful to you. + +## **Query a VCF file in Big Query** +The output of genomic variant calling workflows is a file in the variant call format (VCF). These are often large, structured data files that can be searched using database query tools such as Big Query. +- Learn how to use Big Query to run queries against large VCF files from Gnomad data using [this notebook](https://github.com/GoogleCloudPlatform/rad-lab/blob/main/modules/data_science/scripts/build/notebooks/Exploring_gnomad_on_BigQuery.ipynb). If any cells give you errors, try running that cell again and it should work, there seems to be some lag time between cells. + +## **Genome Wide Association Studies** +Genome-wide association studies (GWAS) are large-scale investigations that analyze the genomes of many individuals to identify common genetic variants associated with traits, diseases, or other phenotypes. +- This [NIH CFDE written tutorial](https://training.nih-cfde.org/en/latest/Bioinformatic-Analyses/GWAS-in-the-cloud +) walks you through running a simple GWAS using AWS, thus we have rewritten it as a notebook to work on GCP [here](/notebooks/GWASCoatColor). Make sure you select R as your kernel when you spin up your notebook so that you can switch between R and Python (this only applies to 'User Managed Notebooks') but note that our team experienced conda permission issues with the new Managed Notebooks for this tutorial, so we recommend using the 'User Managed Notebooks'. Also, if the imported notebook has cells already printed out, just go to Kernel > Restart Kernel and Clear all Outputs. +- [This tutorial](https://github.com/david-thrower-nih/DL-gwas-gcp-example) from NIH NIEHS (credit to David Thrower) builds on a published deep learning method for GWAS of soybeans and users Kubeflow and AutoML on a Kubernetes instance. + +## **Proteomics** +Proteomics is the study of the entire set of proteins in a cell, tissue, or organism, aiming to understand their structure, function, and interactions to uncover insights into biological processes and diseases. Although most primary proteomic analyses occur in proprietary software platforms, a lot of secondary analysis happens in Jupyter or R notebooks, we give several examples here: +- Use Big Query to run a Kruskal Wallis Test on Proteomics data using [these notebooks](https://github.com/isb-cgc/Community-Notebooks/tree/master/FeaturedNotebooks). Clone the repo into Vertex AI, or just drag the notebooks into a Vertex AI Workbench instance. In the notebook titled 'ACM_BCB_2020_POSTER_KruskalWallisTest_ProteinGeneExpression_vs_ClinicalFeatures.ipyng', the first BigQuery cell may throw an error, but ignore this and keep going, the rest of the notebook should run fine. Also, in that first big cell, make sure you add your Project ID. See this [doc](/docs/protein_setup.md) for environment setup instructions. +- Run AlphaFold in Vertex AI using [this notebook](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/community-content/alphafold_on_workbench/AlphaFold.ipynb). Make sure you have a GPU for your notebook instance, and follow [these instructures](https://cloud.google.com/blog/products/ai-machine-learning/running-alphafold-on-vertexai) for setting up your environment. Namely, under Environment, select `Custom container`, and then for `Docker container image` paste in the following: `west1-docker.pkg.dev/cloud-devrel-public-resources/alphafold/alphafold-on-gcp:latest`. +- Conduct secondary analysis of Proteomic data using this [NIGMS Sandbox notebook](https://github.com/NIGMS/ProteomicsUAMS), developed by the University of Arkansas for Medical Sciences. + +## **RNAseq and Transcriptome Assembly** +RNA-seq analysis is a high-throughput sequencing method that allows the measurement and characterization of gene expression levels and transcriptome dynamics. Workflows are typically run using workflow managers, and final results can often be visualized in notebooks. +- You can run this [Nextflow tutorial](https://nf-co.re/rnaseq/3.7) for RNAseq a variety of ways on GCP. Following the instructions outlined above, you could use Compute Engine, [Life Sciences API](https://cloud.google.com/life-sciences/docs/tutorials/Nextflow), or a Vertex AI notebook. +- For a notebook version of a complete RNAseq pipeline from Fastq to Salmon quantification go through these tutorials from the [NIGMS Sandbox Project](https://github.com/NIGMS/RNAseqUM) developed by The University of Maine. +- Likewise, [This multi-omics module](https://github.com/NIGMS/MultiomicsUND) from the University of North Dakota includes an RNAseq component. + +Transcriptome assembly is the process of reconstructing the complete set of RNA transcripts in a cell or tissue from fragmented sequencing data, providing valuable insights into gene expression and functional analysis. +- [This module](https://github.com/NIGMS/rnaAssemblyMDI) developed by the MDI Biological Laboratory for the NIGMS Sandbox Project walks you through transcriptome assembly using Nextflow. + +## **Single Cell RNAseq** +Single-cell RNA sequencing (scRNA-seq) is a technique that enables the analysis of gene expression at the individual cell level, providing insights into cellular heterogeneity, identifying rare cell types, and revealing cellular dynamics and functional states within complex biological systems. +- This [NVIDIA blog](https://developer.nvidia.com/blog/accelerating-single-cell-genomic-analysis-using-rapids/) details how to run an accelerated scRNAseq pipeline using RAPIDS. You can find a link to the GitHub repository that has lots of example notebooks [here](https://github.com/clara-parabricks/rapids-single-cell-examples). For each example use case they show some nice benchmarking data with time and cost for each machine type. You will see that most runs cost less than $1.00 with GPU machines. Pay careful attention to the environment setup as there are a lot of dependencies for these notebooks. +- The [Scanpy tutorials](https://scanpy.readthedocs.io/en/stable/tutorials.html) page has a lot of good CPU-based examples you could run in Vertex AI. Clone this [GitHub repo](https://github.com/scverse/scanpy-tutorials) to get the notebooks directly. +- Alternatively, here is a [GitHub repository](https://github.com/mdozmorov/scRNA-seq_notes) with a curated list of scRNAseq resources and tutorials. We did not test these in Cloud Lab, but wanted to make them available in case you needed additional resources. + +## **ATACseq and Single Cell ATACseq** +ATAC-seq is a technique that allows scientists to understand how DNA is packaged in cells by identifying the regions of DNA that are accessible and potentially involved in gene regulation. +-[This module](https://github.com/NIGMS/atacseqUNMC) walks you through how to work through an ATACseq and single-cell ATACseq workflow on Google Cloud. The module was developed by the University of Nebraska Medical Center for the NIGMS Sandbox Project. + +## **Methylseq** +As one of the most abundant and well-studied epigenetic modifications, DNA methylation plays an essential role in normal cell development and has various effects on transcription, genome stability, and DNA packaging within cells. Methylseq is a technique to identify methylated regions of the genome. +- The University of Hawai'i at Manoa developed [this set of notebooks](https://github.com/NIGMS/MethylSeqUH) that walk you through a Methylseq analysis as part of the NIGMS Sandbox Program. + +## **Metagenomics** +Metagenomics is the study of genetic material collected directly from environmental samples, enabling the exploration of microbial communities, their diversity, and their functional potential, without the need for laboratory culturing. +-[This module](https://github.com/NIGMS/MetagenomicsUSD) walks you through conducting a metagenomic analysis using command line and Nextflow. The module was developed by the University of South Dakota as part of the NIGMS Sandbox Project. + +## **Multiomic Analysis and Biomarker Discovery** +Multiomic analysis involves integrating data across modalities (e. g. genomic, transcriptomic, phenotypic) to generate additive insights. +- [This set of notebooks](https://github.com/NIGMS/MultiomicsUND) gives you an example of conducting multiomic analysis in Jupyter notebooks and was developed by the University of North Dakota as part of the NIGMS Sandbox Project. + +Biomarker discovery is the process of identifying specific molecules or characteristics that can serve as indicators of biological processes, diseases, or treatment responses, aiding in diagnosis, prognosis, and personalized medicine. Biomarker discovery is typically conducted through comprehensive analysis of various types of data, such as genomics, proteomics, metabolomics, and clinical data, using advanced techniques including high-throughput screening, bioinformatics, and statistical analysis to identify patterns or signatures that differentiate between healthy and diseased individuals, or responders and non-responders to specific treatments. +- [This module](https://github.com/NIGMS/BiomarkersURI), developed by the University of Rhode Island for the NIGMS Sandbox Project, walks you through conducting some common biomarker discovery analyses in R. + +## **BLAST+** +NCBI BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics program provided by the National Center for Biotechnology Information (NCBI) that compares nucleotide or protein sequences against a large database to identify similar sequences and infer evolutionary relationships, functional annotations, and structural information. +- This [Common Data Fund](https://training.nih-cfde.org/en/latest/Cloud-Platforms/Introduction-to-GCP/gcp3/) tutorial explains how to use basic BLAST on GCP. +- We also rewrote [this ElastBLAST tutorial](https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/quickstart-gcp.html) as a [notebook](/notebooks/elasticBLAST) that will work in VertexAI. + +## **Long Read Sequence Analysis** +Long read DNA sequence analysis involves analyzing sequencing reads typically longer than 10 thousand base pairs (bp) in length, compared with short read sequencing where reads are about 150 bp in length. Oxford Nanopore has a pretty complete offering of notebook tutorials for handling long read data to do a variety of things including variant calling, RNAseq, Sars-Cov-2 analysis and much more. You can find a list and description of notebooks [here](https://labs.epi2me.io/nbindex/), or clone the [GitHub repo](https://github.com/epi2me-labs). Note that these notebooks expect you are running locally and accessing the epi2me notebook server. To run them in Cloud Lab, skip the first cell that connects to the server and then the rest of the notebook should run correctly, with a few tweaks. + +## **Drug Discovery** +The [Accelerating Therapeutics for Opportunities in Medicine (ATOM) Consortium](https://atomscience.org/) created a series of [Jupyter notebooks](https://github.com/ATOMScience-org/AMPL/tree/master/atomsci/ddm/examples/tutorials) that walk you through the ATOM approach to Drug Discovery. + +These notebooks were created to run in Google Colab, so if you run them in Google Cloud, you will need to make a few modification. First, we recommend you use a [Google Managed Notebook](https://cloud.google.com/vertex-ai/docs/workbench/managed/introduction) rather than a User-Managed notebook simply because the Google Managed notebooks already have Tensorflow and other dependencies installed. Be sure to attach a GPU to your instance (T4 is fine). Also, you will need to comment out `%tensorflow_version 2.x` since that is a Colab-specific command. You will also need to `pip install` a few packages as needed. If you get errors with `deepchem`, try running `pip install --pre deepchem[tensorflow]` and/or `pip install --pre deepchem[torch]`. Also, some notebooks will require a Tensorflow kernel, while others require Pytorch. You may also run into a Pandas error, reach out to the ATOM GitHub developers for the best solution to this issue. + +## **Using Google Batch** +You can interact with Google Batch directly to submit commands, or more commonly you can interact with it through orchestration engines like [Nextflow](https://www.Nextflow.io/docs/latest/google.html) and [Cromwell](https://cromwell.readthedocs.io/en/latest/backends/GCPBatch/), etc. We have tutorials that utilize Google Batch using [Nextflow](/notebooks/GoogleBatch/Nextflow) where we run the nf-core Methylseq pipeline, as well as several from the NIGMS Sandbox including [transcriptome assembly](https://github.com/NIGMS/rnaAssemblyMDI), [multiomics](https://github.com/NIGMS/MultiomicsUND), [methylseq](https://github.com/NIGMS/MethylSeqUH), and [metagenomics](https://github.com/NIGMS/MetagenomicsUSD). + +## **Using the Life Sciences API (depreciated)** +__Life Science API is depreciated on GCP and will no longer be available by July 8, 2025 on the platform,__ we recommend using Google Batch instead. For now you can still interact with the Life Sciences API directly to submit commands, or more commonly you can interact with it through orchestration engines like [Snakemake](https://snakemake.readthedocs.io/en/v7.0.0/executor_tutorial/google_lifesciences.html), as of now this workflow manager only supports Life Sciences API. -## **Additional Training** -This repo only scratches the surface of what can be done in the cloud. If you are interested in additional cloud training opportunities. Please visit the [STRIDES Training page](https://cloud.nih.gov/training/). For more information on the STRIDES Initiative at the NIH, visit [our website](https://cloud.nih.gov/) or contact the NIH STRIDES team at STRIDES@nih.gov for more information. +## **Public Data Sets** +Google has a lot of public datasets available that you can use for your testing. These can be viewed [here](https://cloud.google.com/life-sciences/docs/resources/public-datasets) and can be accessed via [BigQuery](https://cloud.google.com/bigquery/public-data) or directly from the cloud bucket. For example, to view Phase 3 1k Genomes at the command line type `gsutil ls gs://genomics-public-data/1000-genomes-phase-3`. diff --git a/envs/environment.yaml b/envs/environment.yaml deleted file mode 100644 index 286d891..0000000 --- a/envs/environment.yaml +++ /dev/null @@ -1,49 +0,0 @@ -name: genomics -channels: - - bioconda - - anaconda - - conda-forge - - defaults - - bih-cubi - - dranew - - hcc - -dependencies: - - bwa - - bowtie2 - - star - - spades - - velvet - - trinity - - samtools - - trimmomatic - - trim-galore - - cutadapt - - picard - - fastqc - - multiqc - - gatk4 - - muse - - lofreq - - somatic-sniper - - openjdk - - rtg-tools - - salmon - - pigz - - vcftools - - snakemake - - snakemake-wrapper-utils - - nextflow - - wdltool - - cwltool - - cromwell - - shapeit - - impute2 - - gemma - - plink2 - - asset - - admixture - - rfmix - - blast - - scanpy - - sra-tools diff --git a/tutorials/notebooks/ATACseq/ATACseq_Tutorial1_Preprocessing.ipynb b/notebooks/ATACseq/ATACseq_Tutorial1_Preprocessing.ipynb similarity index 100% rename from tutorials/notebooks/ATACseq/ATACseq_Tutorial1_Preprocessing.ipynb rename to notebooks/ATACseq/ATACseq_Tutorial1_Preprocessing.ipynb diff --git a/tutorials/notebooks/ATACseq/ATACseq_Tutorial2_PeakDetection.ipynb b/notebooks/ATACseq/ATACseq_Tutorial2_PeakDetection.ipynb similarity index 100% rename from tutorials/notebooks/ATACseq/ATACseq_Tutorial2_PeakDetection.ipynb rename to notebooks/ATACseq/ATACseq_Tutorial2_PeakDetection.ipynb diff --git a/tutorials/notebooks/ATACseq/ATACseq_Tutorial3_Downstream.ipynb b/notebooks/ATACseq/ATACseq_Tutorial3_Downstream.ipynb similarity index 100% rename from tutorials/notebooks/ATACseq/ATACseq_Tutorial3_Downstream.ipynb rename to notebooks/ATACseq/ATACseq_Tutorial3_Downstream.ipynb diff --git a/tutorials/notebooks/DL-gwas-gcp-example/1-d10-run-first.ipynb b/notebooks/DL-gwas-gcp-example/1-d10-run-first.ipynb similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/1-d10-run-first.ipynb rename to notebooks/DL-gwas-gcp-example/1-d10-run-first.ipynb diff --git a/tutorials/notebooks/DL-gwas-gcp-example/2-mse-run-second-in-jupyter.ipynb b/notebooks/DL-gwas-gcp-example/2-mse-run-second-in-jupyter.ipynb similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/2-mse-run-second-in-jupyter.ipynb rename to notebooks/DL-gwas-gcp-example/2-mse-run-second-in-jupyter.ipynb diff --git a/tutorials/notebooks/DL-gwas-gcp-example/README.md b/notebooks/DL-gwas-gcp-example/README.md similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/README.md rename to notebooks/DL-gwas-gcp-example/README.md diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/00-create-new-notebook1.png b/notebooks/DL-gwas-gcp-example/assets/00-create-new-notebook1.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/00-create-new-notebook1.png rename to notebooks/DL-gwas-gcp-example/assets/00-create-new-notebook1.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/000-enable-apis.png b/notebooks/DL-gwas-gcp-example/assets/000-enable-apis.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/000-enable-apis.png rename to notebooks/DL-gwas-gcp-example/assets/000-enable-apis.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/001-marketplace.png b/notebooks/DL-gwas-gcp-example/assets/001-marketplace.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/001-marketplace.png rename to notebooks/DL-gwas-gcp-example/assets/001-marketplace.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/002-name-minikf.png b/notebooks/DL-gwas-gcp-example/assets/002-name-minikf.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/002-name-minikf.png rename to notebooks/DL-gwas-gcp-example/assets/002-name-minikf.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/003-pick-machine.png b/notebooks/DL-gwas-gcp-example/assets/003-pick-machine.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/003-pick-machine.png rename to notebooks/DL-gwas-gcp-example/assets/003-pick-machine.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/004-notebook.png b/notebooks/DL-gwas-gcp-example/assets/004-notebook.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/004-notebook.png rename to notebooks/DL-gwas-gcp-example/assets/004-notebook.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/01-create-new-notebook2.png b/notebooks/DL-gwas-gcp-example/assets/01-create-new-notebook2.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/01-create-new-notebook2.png rename to notebooks/DL-gwas-gcp-example/assets/01-create-new-notebook2.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/01-r2-create-new-notebook2.png b/notebooks/DL-gwas-gcp-example/assets/01-r2-create-new-notebook2.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/01-r2-create-new-notebook2.png rename to notebooks/DL-gwas-gcp-example/assets/01-r2-create-new-notebook2.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/01-r3-create-new-notebook2.png b/notebooks/DL-gwas-gcp-example/assets/01-r3-create-new-notebook2.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/01-r3-create-new-notebook2.png rename to notebooks/DL-gwas-gcp-example/assets/01-r3-create-new-notebook2.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/02-create-new-notebook3.png b/notebooks/DL-gwas-gcp-example/assets/02-create-new-notebook3.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/02-create-new-notebook3.png rename to notebooks/DL-gwas-gcp-example/assets/02-create-new-notebook3.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/03-open-notebook.png b/notebooks/DL-gwas-gcp-example/assets/03-open-notebook.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/03-open-notebook.png rename to notebooks/DL-gwas-gcp-example/assets/03-open-notebook.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/04-upload-notebook-and-data.png b/notebooks/DL-gwas-gcp-example/assets/04-upload-notebook-and-data.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/04-upload-notebook-and-data.png rename to notebooks/DL-gwas-gcp-example/assets/04-upload-notebook-and-data.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/05-enable-kale.png b/notebooks/DL-gwas-gcp-example/assets/05-enable-kale.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/05-enable-kale.png rename to notebooks/DL-gwas-gcp-example/assets/05-enable-kale.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/06-pipeline-parameters.png b/notebooks/DL-gwas-gcp-example/assets/06-pipeline-parameters.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/06-pipeline-parameters.png rename to notebooks/DL-gwas-gcp-example/assets/06-pipeline-parameters.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/07-pipeline-parameters-katib.png b/notebooks/DL-gwas-gcp-example/assets/07-pipeline-parameters-katib.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/07-pipeline-parameters-katib.png rename to notebooks/DL-gwas-gcp-example/assets/07-pipeline-parameters-katib.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/08-pipeline-step.png b/notebooks/DL-gwas-gcp-example/assets/08-pipeline-step.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/08-pipeline-step.png rename to notebooks/DL-gwas-gcp-example/assets/08-pipeline-step.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/09-pipeline-metrics.png b/notebooks/DL-gwas-gcp-example/assets/09-pipeline-metrics.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/09-pipeline-metrics.png rename to notebooks/DL-gwas-gcp-example/assets/09-pipeline-metrics.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/10-setup-katib.png b/notebooks/DL-gwas-gcp-example/assets/10-setup-katib.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/10-setup-katib.png rename to notebooks/DL-gwas-gcp-example/assets/10-setup-katib.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/11-r2-setup-job.png b/notebooks/DL-gwas-gcp-example/assets/11-r2-setup-job.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/11-r2-setup-job.png rename to notebooks/DL-gwas-gcp-example/assets/11-r2-setup-job.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/11-setup-job.png b/notebooks/DL-gwas-gcp-example/assets/11-setup-job.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/11-setup-job.png rename to notebooks/DL-gwas-gcp-example/assets/11-setup-job.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/12-compare.png b/notebooks/DL-gwas-gcp-example/assets/12-compare.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/12-compare.png rename to notebooks/DL-gwas-gcp-example/assets/12-compare.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/13-parallel-coords.png b/notebooks/DL-gwas-gcp-example/assets/13-parallel-coords.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/13-parallel-coords.png rename to notebooks/DL-gwas-gcp-example/assets/13-parallel-coords.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/14-successful-katib-run.png b/notebooks/DL-gwas-gcp-example/assets/14-successful-katib-run.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/14-successful-katib-run.png rename to notebooks/DL-gwas-gcp-example/assets/14-successful-katib-run.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/click-set-cell-kind.png b/notebooks/DL-gwas-gcp-example/assets/click-set-cell-kind.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/click-set-cell-kind.png rename to notebooks/DL-gwas-gcp-example/assets/click-set-cell-kind.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/dl-gwas-headline-1.png b/notebooks/DL-gwas-gcp-example/assets/dl-gwas-headline-1.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/dl-gwas-headline-1.png rename to notebooks/DL-gwas-gcp-example/assets/dl-gwas-headline-1.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/dl-gwas-headline-2.png b/notebooks/DL-gwas-gcp-example/assets/dl-gwas-headline-2.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/dl-gwas-headline-2.png rename to notebooks/DL-gwas-gcp-example/assets/dl-gwas-headline-2.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/enable-compute-engine.png b/notebooks/DL-gwas-gcp-example/assets/enable-compute-engine.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/enable-compute-engine.png rename to notebooks/DL-gwas-gcp-example/assets/enable-compute-engine.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/enable-service-management.png b/notebooks/DL-gwas-gcp-example/assets/enable-service-management.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/enable-service-management.png rename to notebooks/DL-gwas-gcp-example/assets/enable-service-management.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/n2-standard-16.png b/notebooks/DL-gwas-gcp-example/assets/n2-standard-16.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/n2-standard-16.png rename to notebooks/DL-gwas-gcp-example/assets/n2-standard-16.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/old-11-r2-setup-job.png b/notebooks/DL-gwas-gcp-example/assets/old-11-r2-setup-job.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/old-11-r2-setup-job.png rename to notebooks/DL-gwas-gcp-example/assets/old-11-r2-setup-job.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/old-x002-final-results-page.png b/notebooks/DL-gwas-gcp-example/assets/old-x002-final-results-page.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/old-x002-final-results-page.png rename to notebooks/DL-gwas-gcp-example/assets/old-x002-final-results-page.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/run-minikf-startup.png b/notebooks/DL-gwas-gcp-example/assets/run-minikf-startup.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/run-minikf-startup.png rename to notebooks/DL-gwas-gcp-example/assets/run-minikf-startup.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/service-management-api.png b/notebooks/DL-gwas-gcp-example/assets/service-management-api.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/service-management-api.png rename to notebooks/DL-gwas-gcp-example/assets/service-management-api.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/service-usage-api.png b/notebooks/DL-gwas-gcp-example/assets/service-usage-api.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/service-usage-api.png rename to notebooks/DL-gwas-gcp-example/assets/service-usage-api.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/ssh-link.png b/notebooks/DL-gwas-gcp-example/assets/ssh-link.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/ssh-link.png rename to notebooks/DL-gwas-gcp-example/assets/ssh-link.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/stop-instance.png b/notebooks/DL-gwas-gcp-example/assets/stop-instance.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/stop-instance.png rename to notebooks/DL-gwas-gcp-example/assets/stop-instance.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/updated-pipeline-params.png b/notebooks/DL-gwas-gcp-example/assets/updated-pipeline-params.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/updated-pipeline-params.png rename to notebooks/DL-gwas-gcp-example/assets/updated-pipeline-params.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/x001-new-notebook.png b/notebooks/DL-gwas-gcp-example/assets/x001-new-notebook.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/x001-new-notebook.png rename to notebooks/DL-gwas-gcp-example/assets/x001-new-notebook.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/x002-final-results-page.png b/notebooks/DL-gwas-gcp-example/assets/x002-final-results-page.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/x002-final-results-page.png rename to notebooks/DL-gwas-gcp-example/assets/x002-final-results-page.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/x003-git-clone.png b/notebooks/DL-gwas-gcp-example/assets/x003-git-clone.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/x003-git-clone.png rename to notebooks/DL-gwas-gcp-example/assets/x003-git-clone.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/x004-launch-terminal.png b/notebooks/DL-gwas-gcp-example/assets/x004-launch-terminal.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/x004-launch-terminal.png rename to notebooks/DL-gwas-gcp-example/assets/x004-launch-terminal.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/x006-skip.png b/notebooks/DL-gwas-gcp-example/assets/x006-skip.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/x006-skip.png rename to notebooks/DL-gwas-gcp-example/assets/x006-skip.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/x007-wait.png b/notebooks/DL-gwas-gcp-example/assets/x007-wait.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/x007-wait.png rename to notebooks/DL-gwas-gcp-example/assets/x007-wait.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/x008-restart-ok.png b/notebooks/DL-gwas-gcp-example/assets/x008-restart-ok.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/x008-restart-ok.png rename to notebooks/DL-gwas-gcp-example/assets/x008-restart-ok.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/x009-preprocessing.png b/notebooks/DL-gwas-gcp-example/assets/x009-preprocessing.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/x009-preprocessing.png rename to notebooks/DL-gwas-gcp-example/assets/x009-preprocessing.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/xx-0002-pick-a-run.png b/notebooks/DL-gwas-gcp-example/assets/xx-0002-pick-a-run.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/xx-0002-pick-a-run.png rename to notebooks/DL-gwas-gcp-example/assets/xx-0002-pick-a-run.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/xx0001-navigate-to-experiment.png b/notebooks/DL-gwas-gcp-example/assets/xx0001-navigate-to-experiment.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/xx0001-navigate-to-experiment.png rename to notebooks/DL-gwas-gcp-example/assets/xx0001-navigate-to-experiment.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/xx0003-pick-pipeline-step.png b/notebooks/DL-gwas-gcp-example/assets/xx0003-pick-pipeline-step.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/xx0003-pick-pipeline-step.png rename to notebooks/DL-gwas-gcp-example/assets/xx0003-pick-pipeline-step.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/xx0006-step-logs.png b/notebooks/DL-gwas-gcp-example/assets/xx0006-step-logs.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/xx0006-step-logs.png rename to notebooks/DL-gwas-gcp-example/assets/xx0006-step-logs.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/assets/xx0007-step-io.png b/notebooks/DL-gwas-gcp-example/assets/xx0007-step-io.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/assets/xx0007-step-io.png rename to notebooks/DL-gwas-gcp-example/assets/xx0007-step-io.png diff --git a/tutorials/notebooks/DL-gwas-gcp-example/nb_assets/x002-final-results-page.png b/notebooks/DL-gwas-gcp-example/nb_assets/x002-final-results-page.png similarity index 100% rename from tutorials/notebooks/DL-gwas-gcp-example/nb_assets/x002-final-results-page.png rename to notebooks/DL-gwas-gcp-example/nb_assets/x002-final-results-page.png diff --git a/tutorials/notebooks/GWASCoatColor/GWAS_coat_color.ipynb b/notebooks/GWASCoatColor/GWAS_coat_color.ipynb similarity index 100% rename from tutorials/notebooks/GWASCoatColor/GWAS_coat_color.ipynb rename to notebooks/GWASCoatColor/GWAS_coat_color.ipynb diff --git a/tutorials/notebooks/GenAI/GCP_GenAI_Huggingface.ipynb b/notebooks/GenAI/GCP_GenAI_Huggingface.ipynb similarity index 100% rename from tutorials/notebooks/GenAI/GCP_GenAI_Huggingface.ipynb rename to notebooks/GenAI/GCP_GenAI_Huggingface.ipynb diff --git a/tutorials/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb b/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb similarity index 100% rename from tutorials/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb rename to notebooks/GenAI/GCP_Pubmed_chatbot.ipynb diff --git a/tutorials/notebooks/GenAI/Gemini_Intro.ipynb b/notebooks/GenAI/Gemini_Intro.ipynb similarity index 100% rename from tutorials/notebooks/GenAI/Gemini_Intro.ipynb rename to notebooks/GenAI/Gemini_Intro.ipynb diff --git a/tutorials/notebooks/GenAI/VertexAIStudioGCP.ipynb b/notebooks/GenAI/VertexAIStudioGCP.ipynb similarity index 100% rename from tutorials/notebooks/GenAI/VertexAIStudioGCP.ipynb rename to notebooks/GenAI/VertexAIStudioGCP.ipynb diff --git a/tutorials/notebooks/GenAI/example_scripts/example_langchain_chat_llama_2_zeroshot.py b/notebooks/GenAI/example_scripts/example_langchain_chat_llama_2_zeroshot.py similarity index 100% rename from tutorials/notebooks/GenAI/example_scripts/example_langchain_chat_llama_2_zeroshot.py rename to notebooks/GenAI/example_scripts/example_langchain_chat_llama_2_zeroshot.py diff --git a/tutorials/notebooks/GenAI/example_scripts/example_vectorsearch_chat_llama_2_zeroshot.py b/notebooks/GenAI/example_scripts/example_vectorsearch_chat_llama_2_zeroshot.py similarity index 100% rename from tutorials/notebooks/GenAI/example_scripts/example_vectorsearch_chat_llama_2_zeroshot.py rename to notebooks/GenAI/example_scripts/example_vectorsearch_chat_llama_2_zeroshot.py diff --git a/tutorials/notebooks/GenAI/langchain_on_vertex.ipynb b/notebooks/GenAI/langchain_on_vertex.ipynb similarity index 100% rename from tutorials/notebooks/GenAI/langchain_on_vertex.ipynb rename to notebooks/GenAI/langchain_on_vertex.ipynb diff --git a/tutorials/notebooks/GoogleBatch/.gitkeep b/notebooks/GoogleBatch/.gitkeep similarity index 100% rename from tutorials/notebooks/GoogleBatch/.gitkeep rename to notebooks/GoogleBatch/.gitkeep diff --git a/tutorials/notebooks/GoogleBatch/nextflow/Part1_GBatch_Nextflow.ipynb b/notebooks/GoogleBatch/nextflow/Part1_GBatch_Nextflow.ipynb similarity index 100% rename from tutorials/notebooks/GoogleBatch/nextflow/Part1_GBatch_Nextflow.ipynb rename to notebooks/GoogleBatch/nextflow/Part1_GBatch_Nextflow.ipynb diff --git a/tutorials/notebooks/GoogleBatch/nextflow/Part2_GBatch_Nextflow.ipynb b/notebooks/GoogleBatch/nextflow/Part2_GBatch_Nextflow.ipynb similarity index 100% rename from tutorials/notebooks/GoogleBatch/nextflow/Part2_GBatch_Nextflow.ipynb rename to notebooks/GoogleBatch/nextflow/Part2_GBatch_Nextflow.ipynb diff --git a/tutorials/notebooks/LifeSciencesAPI/nextflow/Part1_LS_API_Nextflow.ipynb b/notebooks/LifeSciencesAPI/nextflow/Part1_LS_API_Nextflow.ipynb similarity index 100% rename from tutorials/notebooks/LifeSciencesAPI/nextflow/Part1_LS_API_Nextflow.ipynb rename to notebooks/LifeSciencesAPI/nextflow/Part1_LS_API_Nextflow.ipynb diff --git a/tutorials/notebooks/LifeSciencesAPI/nextflow/Part2_LS_API_Nextflow.ipynb b/notebooks/LifeSciencesAPI/nextflow/Part2_LS_API_Nextflow.ipynb similarity index 100% rename from tutorials/notebooks/LifeSciencesAPI/nextflow/Part2_LS_API_Nextflow.ipynb rename to notebooks/LifeSciencesAPI/nextflow/Part2_LS_API_Nextflow.ipynb diff --git a/tutorials/notebooks/LifeSciencesAPI/snakemake/LS_API_Snakemake.ipynb b/notebooks/LifeSciencesAPI/snakemake/LS_API_Snakemake.ipynb similarity index 100% rename from tutorials/notebooks/LifeSciencesAPI/snakemake/LS_API_Snakemake.ipynb rename to notebooks/LifeSciencesAPI/snakemake/LS_API_Snakemake.ipynb diff --git a/tutorials/notebooks/SRADownload/SRA-Download.ipynb b/notebooks/SRADownload/SRA-Download.ipynb similarity index 100% rename from tutorials/notebooks/SRADownload/SRA-Download.ipynb rename to notebooks/SRADownload/SRA-Download.ipynb diff --git a/tutorials/notebooks/SpleenLiverSegmentation/README.md b/notebooks/SpleenLiverSegmentation/README.md similarity index 100% rename from tutorials/notebooks/SpleenLiverSegmentation/README.md rename to notebooks/SpleenLiverSegmentation/README.md diff --git a/tutorials/notebooks/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb b/notebooks/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb similarity index 100% rename from tutorials/notebooks/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb rename to notebooks/SpleenLiverSegmentation/SpleenSeg_Pretrained-4_27.ipynb diff --git a/tutorials/notebooks/SpleenLiverSegmentation/monai_data/Spleen_best_metric_model_pretrained.pth b/notebooks/SpleenLiverSegmentation/monai_data/Spleen_best_metric_model_pretrained.pth similarity index 100% rename from tutorials/notebooks/SpleenLiverSegmentation/monai_data/Spleen_best_metric_model_pretrained.pth rename to notebooks/SpleenLiverSegmentation/monai_data/Spleen_best_metric_model_pretrained.pth diff --git a/tutorials/notebooks/elasticBLAST/run_elastic_blast.ipynb b/notebooks/elasticBLAST/run_elastic_blast.ipynb similarity index 100% rename from tutorials/notebooks/elasticBLAST/run_elastic_blast.ipynb rename to notebooks/elasticBLAST/run_elastic_blast.ipynb diff --git a/tutorials/notebooks/ncbi-stat-tutorial/STAT-tutorial.ipynb b/notebooks/ncbi-stat-tutorial/STAT-tutorial.ipynb similarity index 100% rename from tutorials/notebooks/ncbi-stat-tutorial/STAT-tutorial.ipynb rename to notebooks/ncbi-stat-tutorial/STAT-tutorial.ipynb diff --git a/tutorials/notebooks/pangolin/pangolin_pipeline.ipynb b/notebooks/pangolin/pangolin_pipeline.ipynb similarity index 100% rename from tutorials/notebooks/pangolin/pangolin_pipeline.ipynb rename to notebooks/pangolin/pangolin_pipeline.ipynb diff --git a/tutorials/README.md b/tutorials/README.md deleted file mode 100644 index 9d4a898..0000000 --- a/tutorials/README.md +++ /dev/null @@ -1,128 +0,0 @@ -# GCP Tutorial Resources - -_We have pulled together a variety of tutorials here from disparate sources. Some use Compute Engine, others use Vertex AI notebooks and others use only managed services. Tutorials are organized by research method, but we try to designate what GCP services are used as well to help you navigate._ ---------------------------------- -## Overview of Page Contents - -+ [Biomedical Workflows on GCP](#bds) -+ [Artificial Intelligence and Machine Learning](#ml) -+ [Medical Imaging](#mi) -+ [Download SRA Data](#sra) -+ [Variant Calling](#vc) -+ [VCF Query](#vcf) -+ [GWAS](#gwas) -+ [Proteomics](#pro) -+ [RNAseq and Transcriptome Assembly](#rna) -+ [scRNAseq](#sc) -+ [ATACseq and scATACseq](#atac) -+ [Methylseq](#ms) -+ [Metagenomics](#meta) -+ [Multiomics and Biomarker Analysis](#mo) -+ [BLAST+](#bl) -+ [Long Read Sequencing Analysis](#long) -+ [Drug Discovery](#atom) -+ [Using Google Batch](#gbatch) -+ [Using the Life Sciences API (depreciated)](#lsapi) -+ [Public Data Sets](#pub) - -## **Biomedical Workflows on GCP** -There are a lot of ways to run workflows on GCP. Here we list a few possibilities each of which may work for different research aims. As you walk through the various tutorials below, think about how you could possibly run that workflow more efficiently using one of the other methods listed here. - -- The simplest method is probably to spin up a Compute Engine instance, and run your command interactively, or using `screen` or, as a [startup script](https://cloud.google.com/compute/docs/instances/startup-scripts/linux) attached as metadata. -- You could also run your pipeline via a Vertex AI notebook, either by splitting out each command as a different block, or by running a workflow manager (Nextflow etc.). [Schedule notebooks](https://codelabs.developers.google.com/vertex_notebook_executor#0) to let them run longer. -You can find a nice tutorial for using managed notebooks [here](https://codelabs.developers.google.com/vertex_notebook_executor#0). Note that there is now a difference between `managed notebooks` and `user managed notebooks`. The `managed notebooks` have more features and can be scheduled, but give you less control about conda environments/install. -- You can interact with [Google Batch](https://cloud.google.com/batch/docs/get-started), or the [Google Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest) using a workflow manager like [Nextflow](https://cloud.google.com/life-sciences/docs/tutorials/nextflow), [Snakemake](https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/googlebatch.html), or [Cromwell](https://github.com/GoogleCloudPlatform/rad-lab/tree/main/modules/genomics_cromwell). We currently have example notebooks for both [Nextflow and Snakemake that use the Life Sciences API](/tutorials/notebooks/LifeSciencesAPI/), as well as [Google Batch with Nextflow](/tutorials/notebooks/GoogleBatch/nextflow) as well as a [local version of Snakemake run via Pangolin](/tutorials/notebooks/pangolin). -- You may find other APIs better suite your needs such as the [Google Cloud Healthcare Data Engine](https://cloud.google.com/healthcare). -- Most of the notebooks below require just a few CPUs. Start small (maybe 4 CPUs), then scale up as needed. Likewise, when you need a GPU, start with a smaller or older generation GPU (e.g. T4) for testing, then switch to a newer GPU (A100/V100) once you know things will work or you need more compute power. - -## **Artificial Intelligence and Machine Learning** -Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Machine learning on GCP generally occurs within VertexAI. You can learn more about machine learning on GCP at this [Google Crash Course](https://developers.google.com/machine-learning/crash-course). For hands-on examples, try out [this module](https://github.com/NIGMS/COVIDMachineLearningSFSU) developed by San Francisco State University or [this one from the University of Arkasas](https://github.com/NIGMS/MachineLearningUA) developed for the NIGMS Sandbox Project. - -Now that the age of **Generative AI** (Gen AI) has arrived, Google has released a host of Gen AI offerings within the Vertex AI suite. Some examples of what generative AI models are capable of are extracting wanted information from text, transforming speech into text, generating images from descriptions and vice versa, and much more. Vertex AI's [Vertex AI Studio](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/generative-ai-studio) console allows the user to rapidly create, test, and train generative AI models on the cloud in a safe and secure setting, see our overview in [this tutorial](/tutorials/notebooks/GenAI/VertexAIStudioGCP.ipynb). The studio also has ready-to-use models all contained with in the [Model Garden](https://cloud.google.com/vertex-ai/docs/start/explore-models). These models range from foundation models, fine-tunable models, and task-specific solutions. -- To learn more about Gen AI on GCP take a look at our [GenAI tutorials](/tutorials/notebooks/GenAI) that go over several GCP products such as [Gemini](/tutorials/notebooks/GenAI/Gemini_Intro.ipynb) and [Vector Search](/tutorials/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb) and other tools like [Langchain](/tutorials/notebooks/GenAI/langchain_on_vertex.ipynb) and [Huggingface](/tutorials/notebooks/GenAI/GCP_GenAI_Huggingface.ipynb) to deploy, train, prompt, and implement techniques like [Retrieval-Augmented Generation (RAG)](/tutorials/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb) to GenAI models. -- The Google github also provides many generative AI tutorials hosted on [GitHub](https://github.com/GoogleCloudPlatform/generative-ai/tree/main). Some example they provide are under [language here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/language). - -## **Medical Image Segmentation** -Medical image analysis is the application of computational algorithms and techniques to extract meaningful information from medical images for diagnosis, treatment planning, and research purposes. Medical image analysis requires large image files and often elastic storage and accelerated computing. -- Most medical imaging analyses are done in notebooks, so we would recommend downloading the Jupyter Notebook from [here](/tutorials/notebooks/SpleenLiverSegmentation) and then importing or cloning it into VertexAI. The tutorial walks through image segmentation using the Monai framework. -- You can also request early access to the new [Google Medical Imaging Suite](https://cloud.google.com/medical-imaging) to see if it would fit your use case. - -## **Download Data From the Sequence Read Archive (SRA)** -Next Generation genetic sequence data is housed in the NCBI Sequence Read Archive (SRA). You can access these data using the SRA Toolkit. We walk you through this using [this notebook](/tutorials/notebooks/SRADownload), including how to use BigQuery to generate your list of Accessions. You can also use BigQuery to create a list of accessions for download using [this setup guide](https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery/) and this [query guide](https://www.ncbi.nlm.nih.gov/sra/docs/sra-bigquery-examples/). Additional example notebooks can be found at this [NCBI repo](https://github.com/ncbi/ASHG-Workshop-2021). In particular, we recommend this notebook(https://github.com/ncbi/ASHG-Workshop-2021/blob/main/1_Basic_BigQuery_Examples.ipynb), which goes into more detail on using BigQuery to access the results of the SRA Taxonomic Analysis Tool, which often differ from the user input species name due to contamination, error, or due to samples being metagenomic in nature. Further, [this notebook](https://github.com/ncbi/ASHG-Workshop-2021/blob/main/2_Array_Examples.ipynb) does a deep dive on parsing the BigQuery results and may give you some good ideas on how to best search for samples from SRA. The SRA metadata and taxonomy analyses are in separate BigQuery tables, you can learn how to join those two tables using SQL from [this Powerpoint](https://github.com/NCBI-Codeathons/NCGAS-cloud-workshop/blob/main/5_BigQuery.pptx) or from our tutorial [here](/tutorials/notebooks/ncbi-stat-tutorial/). Finally, NCBI released [this workshop](https://github.com/ncbi/workshop-asm-ngs-2022/wiki) in 2022 that walks through a wide variety of BigQuery applications with NCBI datasets, we highly recommend you take a look. - -## **Variant Calling** -Genomic variant calling is the process of identifying and characterizing genetic variations from DNA sequencing data to understand differences in an individual's genetic makeup. -- This [Google tutorial](https://cloud.google.com/life-sciences/docs/tutorials/gatk) shows you how to run the GATK Best Practices pipeline for genomic variant calling using the Life Sciences API. There is a section about increasing your account quotas, you can skip that. You could also run GATK using any of the workflow managers and submitting to the Life Sciences API. -- One tutorial specific to somatic variant calling comes from the Sheffield Bioinformatics Core [here](https://sbc.shef.ac.uk/somatic-variants/index.nb.html). It runs on Galaxy, but can be adapted to run in GCP. At the very least, the [data](https://drive.google.com/drive/folders/1RhrmfW3vMhPwAiBGdFIKfINWMsdvIG6E) may prove useful to you. - -## **Query a VCF file in Big Query** -The output of genomic variant calling workflows is a file in the variant call format (VCF). These are often large, structured data files that can be searched using database query tools such as Big Query -- Learn how to use Big Query to run queries against large VCF files from Gnomad data using [this notebook](https://github.com/GoogleCloudPlatform/rad-lab/blob/main/modules/data_science/scripts/build/notebooks/Exploring_gnomad_on_BigQuery.ipynb). If any cells give you errors, try running that cell again and it should work, there seems to be some lag time between cells. - -## **Genome Wide Association Studies** -Genome-wide association studies (GWAS) are large-scale investigations that analyze the genomes of many individuals to identify common genetic variants associated with traits, diseases, or other phenotypes. -- This [NIH CFDE written tutorial](https://training.nih-cfde.org/en/latest/Bioinformatic-Analyses/GWAS-in-the-cloud -) walks you through running a simple GWAS using AWS, thus we have rewritten it as a notebook to work on GCP [here](/tutorials/notebooks/GWASCoatColor). Make sure you select R as your kernel when you spin up your notebook so that you can switch between R and Python (this only applies to 'User Managed Notebooks') but note that our team experienced conda permission issues with the new Managed Notebooks for this tutorial, so we recommend using the 'User Managed Notebooks'. Also, if the imported notebook has cells already printed out, just go to Kernel > Restart Kernel and Clear all Outputs. -- [This tutorial](https://github.com/david-thrower-nih/DL-gwas-gcp-example) from NIH NIEHS (credit to David Thrower) builds on a published deep learning method for GWAS of soybeans and users Kubeflow and AutoML on a Kubernetes instance. - -## **Proteomics** -Proteomics is the study of the entire set of proteins in a cell, tissue, or organism, aiming to understand their structure, function, and interactions to uncover insights into biological processes and diseases. Although most primary proteomic analyses occur in proprietary software platforms, a lot of secondary analysis happens in Jupyter or R notebooks, we give several examples here: -- Use Big Query to run a Kruskal Wallis Test on Proteomics data using [these notebooks](https://github.com/isb-cgc/Community-Notebooks/tree/master/FeaturedNotebooks). Clone the repo into Vertex AI, or just drag the notebooks into a Vertex AI Workbench instance. In the notebook titled 'ACM_BCB_2020_POSTER_KruskalWallisTest_ProteinGeneExpression_vs_ClinicalFeatures.ipyng', the first BigQuery cell may throw an error, but ignore this and keep going, the rest of the notebook should run fine. Also, in that first big cell, make sure you add your Project ID. See this [doc](/docs/protein_setup.md) for environment setup instructions. -- Run AlphaFold in Vertex AI using [this notebook](https://github.com/GoogleCloudPlatform/vertex-ai-samples/blob/main/community-content/alphafold_on_workbench/AlphaFold.ipynb). Make sure you have a GPU for your notebook instance, and follow [these instructures](https://cloud.google.com/blog/products/ai-machine-learning/running-alphafold-on-vertexai) for setting up your environment. Namely, under Environment, select `Custom container`, and then for `Docker container image` paste in the following: `west1-docker.pkg.dev/cloud-devrel-public-resources/alphafold/alphafold-on-gcp:latest`. -- Conduct secondary analysis of Proteomic data using this [NIGMS Sandbox notebook](https://github.com/NIGMS/ProteomicsUAMS), developed by the University of Arkansas for Medical Sciences. - -## **RNAseq and Transcriptome Assembly** -RNA-seq analysis is a high-throughput sequencing method that allows the measurement and characterization of gene expression levels and transcriptome dynamics. Workflows are typically run using workflow managers, and final results can often be visualized in notebooks. -- You can run this [Nextflow tutorial](https://nf-co.re/rnaseq/3.7) for RNAseq a variety of ways on GCP. Following the instructions outlined above, you could use Compute Engine, [Life Sciences API](https://cloud.google.com/life-sciences/docs/tutorials/nextflow), or in Vertex AI notebook. -- For a notebook version of a complete RNAseq pipeline from Fastq to Salmon quantification go through these tutorials from the [NIGMS Sandbox Project](https://github.com/NIGMS/RNAseqUM) developed by The University of Maine INBRE. -- Likewise, [This multi-omics module](https://github.com/NIGMS/MultiomicsUND) from the University of North Dakota has an RNAseq workflow. - -Transcriptome assembly is the process of reconstructing the complete set of RNA transcripts in a cell or tissue from fragmented sequencing data, providing valuable insights into gene expression and functional analysis. -- [This module](https://github.com/NIGMS/rnaAssemblyMDI) developed by the MDI Biological Laboratory for the NIGMS Sandbox Project walks you through transcriptome assembly using nextflow. - -## **Single Cell RNAseq** -Single-cell RNA sequencing (scRNA-seq) is a technique that enables the analysis of gene expression at the individual cell level, providing insights into cellular heterogeneity, identifying rare cell types, and revealing cellular dynamics and functional states within complex biological systems. -- This [NVIDIA blog](https://developer.nvidia.com/blog/accelerating-single-cell-genomic-analysis-using-rapids/) details how to run an accelerated scRNAseq pipeline using RAPIDS. You can find a link to the GitHub repository that has lots of example notebooks [here](https://github.com/clara-parabricks/rapids-single-cell-examples). For each example use case they show some nice benchmarking data with time and cost for each machine type. You will see that most runs cost less than $1.00 with GPU machines. Pay careful attention to the environment setup as there are a lot of dependencies for these notebooks and it can be challenging to get set up. -- The [Scanpy tutorials](https://scanpy.readthedocs.io/en/stable/tutorials.html) page has a lot of good CPU-based examples you could run in Vertex AI. Clone this [GitHub repo](https://github.com/scverse/scanpy-tutorials) to get the notebooks directly. -- Alternatively, here is a [GitHub repository](https://github.com/mdozmorov/scRNA-seq_notes) with a curated list of scRNAseq resources and tutorials. We did not test these in cloud lab, but wanted to make them available in case you needed additional resources. - -## **ATACseq and Single Cell ATACseq** -ATAC-seq is a technique that allows scientists to understand how DNA is packaged in cells by identifying the regions of DNA that are accessible and potentially involved in gene regulation. --[This module](https://github.com/NIGMS/atacseqUNMC) walks you through how to work through an ATACseq and single-cell ATACseq workflow on Google Cloud. The module was developed by the University of Nebraska Medical Center for the NIGMS Sandbox Project. - -## **Methylseq** -As one of the most abundant and well-studied epigenetic modifications, DNA methylation plays an essential role in normal cell development and has various effects on transcription, genome stability, and DNA packaging within cells. Methylseq is a technique to identify methylated regions of the genome. -- The University of Hawai'i at Manoa developed [this set of notebooks](https://github.com/NIGMS/MethylSeqUH) that walk you through a Methylseq analysis as part of the NIGMS Sandbox Program. - -## **Metagenomics** -Metagenomics is the study of genetic material collected directly from environmental samples, enabling the exploration of microbial communities, their diversity, and their functional potential, without the need for laboratory culturing. --[This module](https://github.com/NIGMS/MetagenomicsUSD) walks you through conducting a metagenomic analysis using command line and nextflow. The module was developed by the University of South Dakota as part of the NIGMS Sandbox Project. - -## **Multiomic Analysis and Biomarker Discovery** -Multiomic analysis involves integrating data across modalities (e. g. genomic, transcriptomic, phenotypic) to generate additive insights. -- [This set of notebooks](https://github.com/NIGMS/MultiomicsUND) gives you an example of conducting multiomic analysis in Jupyter Notebooks and was developed by the University of North Dakota as part of the NIGMS Sandbox Project. - -Biomarker discovery is the process of identifying specific molecules or characteristics that can serve as indicators of biological processes, diseases, or treatment responses, aiding in diagnosis, prognosis, and personalized medicine. Biomarker discovery is typically conducted through comprehensive analysis of various types of data, such as genomics, proteomics, metabolomics, and clinical data, using advanced techniques including high-throughput screening, bioinformatics, and statistical analysis to identify patterns or signatures that differentiate between healthy and diseased individuals, or responders and non-responders to specific treatments. -- [This module](https://github.com/NIGMS/BiomarkersURI), developed by the University of Rhode Island for the NIGMS Sandbox Project, walks you through conducting some common biomarker discovery analyses in R. - -## **BLAST+** -NCBI BLAST (Basic Local Alignment Search Tool) is a widely used bioinformatics program provided by the National Center for Biotechnology Information (NCBI) that compares nucleotide or protein sequences against a large database to identify similar sequences and infer evolutionary relationships, functional annotations, and structural information. -- This [Common Data Fund](https://training.nih-cfde.org/en/latest/Cloud-Platforms/Introduction-to-GCP/gcp3/) tutorial explains how to use basic BLAST on GCP. -- We also rewrote [this ElastBLAST tutorial](https://blast.ncbi.nlm.nih.gov/doc/elastic-blast/quickstart-gcp.html) as a [notebook](/tutorials/notebooks/elasticBLAST) that will work in VertexAI. - -## **Long Read Sequence Analysis** -Long read DNA sequence analysis involves analyzing sequencing reads typically longer than 10 thousand base pairs (bp) in length, compared with short read sequencing where reads are about 150 bp in length. Oxford Nanopore has a pretty complete offering of notebook tutorials for handling long read data to do a variety of things including variant calling, RNAseq, Sars-Cov-2 analysis and much more. You can find a list and description of notebooks [here](https://labs.epi2me.io/nbindex/), or clone the [GitHub repo](https://github.com/epi2me-labs). Note that these notebooks expect you are running locally and accessing the epi2me notebook server. To run them in Cloud Lab, skip the first cell that connects to the server and then the rest of the notebook should run correctly, with a few tweaks. If you are just looking to try out notebooks, don't start with these. If you are interested in long read sequence analysis, then some troubleshooting may be needed to adapt these to the Cloud Lab environment. You may even need to rewrite them in a fresh notebook by adapting the commands. - -## **Drug Discovery** -The [Accelerating Therapeutics for Opportunities in Medicine (ATOM) Consortium](https://atomscience.org/) created a series of [Jupyter notebooks](https://github.com/ATOMScience-org/AMPL/tree/master/atomsci/ddm/examples/tutorials) that walk you through the ATOM approach to Drug Discovery. - -These notebooks were created to run in Google Colab, so if you run them in Google Cloud, you will need to make a few modification. First, we recommend you use a [Google Managed Notebook](https://cloud.google.com/vertex-ai/docs/workbench/managed/introduction) rather than a User-Managed notebook simply because the Google Managed notebooks already have Tensorflow and other dependencies installed. Be sure to attach a GPU to your instance (T4 is fine). Also, you will need to comment out `%tensorflow_version 2.x` since that is a Colab-specific command. You will also need to `pip install` a few packages as needed. If you get errors with `deepchem`, try running `pip install --pre deepchem[tensorflow]` and/or `pip install --pre deepchem[torch]`. Also, some notebooks will require a Tensorflow kernel, while others require Pytorch. You may also run into a Pandas error, reach out to the ATOM GitHub developers for the best solution to this issue. - -## **Using Google Batch** -You can interact with Google Batch directly to submit commands, or more commonly you can interact with it through orchestration engines like [Nextflow](https://www.nextflow.io/docs/latest/google.html) and [Cromwell](https://cromwell.readthedocs.io/en/latest/backends/GCPBatch/), etc. We have tutorials that utilize Google Batch using [Nextflow](/tutorials/notebooks/GoogleBatch/nextflow) where we run the nf-core Methylseq pipeline, as well as several from the NIGMS Sandbox including [transcriptome assembly](https://github.com/NIGMS/rnaAssemblyMDI), [multiomics](https://github.com/NIGMS/MultiomicsUND), [methylseq](https://github.com/NIGMS/MethylSeqUH), and [metagenomics](https://github.com/NIGMS/MetagenomicsUSD). - -## **Using the Life Sciences API (depreciated)** -__Life Science API is depreciated on GCP and will no longer be available by July 8, 2025 on the platform,__ we recommend using Google Batch instead. For now you can still interact with the Life Sciences API directly to submit commands, or more commonly you can interact with it through orchestration engines like [Snakemake](https://snakemake.readthedocs.io/en/v7.0.0/executor_tutorial/google_lifesciences.html), as of now this workflow manager only supports Life Sciences API. - -## **Public Data Sets** -Google has a lot of public datasets available that you can use for your testing. These can be viewed [here](https://cloud.google.com/life-sciences/docs/resources/public-datasets) and can be accessed via [BigQuery](https://cloud.google.com/bigquery/public-data) or directly from the cloud bucket. For example, to view Phase 3 1000 Genomes at the command line type `gsutil ls gs://genomics-public-data/1000-genomes-phase-3`. From 68cb83d3f4c6848c3f954794dda0a0c4bfe2090d Mon Sep 17 00:00:00 2001 From: zbyosufzai <145053952+zbyosufzai@users.noreply.github.com> Date: Wed, 13 Mar 2024 10:53:35 -0400 Subject: [PATCH 2/5] Rename GCP_Pubmed_chatbot.ipynb to Pubmed_RAG_chatbot.ipynb --- .../GenAI/{GCP_Pubmed_chatbot.ipynb => Pubmed_RAG_chatbot.ipynb} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename notebooks/GenAI/{GCP_Pubmed_chatbot.ipynb => Pubmed_RAG_chatbot.ipynb} (100%) diff --git a/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb b/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb similarity index 100% rename from notebooks/GenAI/GCP_Pubmed_chatbot.ipynb rename to notebooks/GenAI/Pubmed_RAG_chatbot.ipynb From b6c655983cc46b9ea086ee61608c0c3c7eceecca Mon Sep 17 00:00:00 2001 From: zbyosufzai <145053952+zbyosufzai@users.noreply.github.com> Date: Wed, 13 Mar 2024 11:01:12 -0400 Subject: [PATCH 3/5] Update README.md --- README.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/README.md b/README.md index 415ded9..feac5a1 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,6 @@ # GCP Tutorial Resources -_We have pulled together a variety of tutorials here from disparate sources. Some use Compute Engine, others use Vertex AI notebooks and others use only managed services. Tutorials are organized by research method, but we try to designate what GCP services are used as well to help you navigate._ +_We have pulled together a variety of tutorials here from disparate sources. Some use Compute Engine, others use Vertex AI notebooks, and others use only managed services. Tutorials are organized by research method, but we try to designate what GCP services are used to help you navigate._ --------------------------------- ## Overview of Page Contents @@ -30,7 +30,7 @@ There are a lot of ways to run workflows on GCP. Here we list a few possibilitie - The simplest method is probably to spin up a Compute Engine instance, and run your command interactively, or using `screen` or, as a [startup script](https://cloud.google.com/compute/docs/instances/startup-scripts/linux) attached as metadata. - You could also run your pipeline via a Vertex AI notebook, either by splitting out each command as a different block, or by running a workflow manager (Nextflow etc.). [Schedule notebooks](https://codelabs.developers.google.com/vertex_notebook_executor#0) to let them run longer. -You can find a nice tutorial for using managed notebooks [here](https://codelabs.developers.google.com/vertex_notebook_executor#0). Note that there is now a difference between `managed notebooks` and `user managed notebooks`. The `managed notebooks` have more features and can be scheduled, but give you less control about conda environments/install. +You can find a nice tutorial for using managed notebooks [here](https://codelabs.developers.google.com/vertex_notebook_executor#0). Note that there is now a difference between `managed notebooks` and `user managed notebooks`. The `managed notebooks` have more features and can be scheduled, but give you less control for conda environments/installs. - You can interact with [Google Batch](https://cloud.google.com/batch/docs/get-started), or the [Google Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest) using a workflow manager like [Nextflow](https://cloud.google.com/life-sciences/docs/tutorials/Nextflow), [Snakemake](https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/googlebatch.html), or [Cromwell](https://github.com/GoogleCloudPlatform/rad-lab/tree/main/modules/genomics_cromwell). We currently have example notebooks for both [Nextflow and Snakemake that use the Life Sciences API](/notebooks/LifeSciencesAPI/), as well as [Google Batch with Nextflow](/notebooks/GoogleBatch/Nextflow) as well as a [local version of Snakemake run via Pangolin](/notebooks/pangolin). - You may find other APIs better suite your needs such as the [Google Cloud Healthcare Data Engine](https://cloud.google.com/healthcare). - Most of the notebooks below require just a few CPUs. Start small (maybe 4 CPUs), then scale up as needed. Likewise, when you need a GPU, start with a smaller or older generation GPU (e.g. T4) for testing, then switch to a newer GPU (A100/V100) once you know things will work or you need more compute power. @@ -39,7 +39,7 @@ You can find a nice tutorial for using managed notebooks [here](https://codelabs Machine learning is a subfield of artificial intelligence that focuses on the development of algorithms and models that enable computers to learn from and make predictions or decisions based on data, without being explicitly programmed. Machine learning on GCP generally occurs within VertexAI. You can learn more about machine learning on GCP at this [Google Crash Course](https://developers.google.com/machine-learning/crash-course). For hands-on examples, try out [this module](https://github.com/NIGMS/COVIDMachineLearningSFSU) developed by San Francisco State University or [this one from the University of Arkasas](https://github.com/NIGMS/MachineLearningUA) developed for the NIGMS Sandbox Project. Now that the age of **Generative AI** (Gen AI) has arrived, Google has released a host of Gen AI offerings within the Vertex AI suite. Some examples of what generative AI models are capable of are extracting wanted information from text, transforming speech into text, generating images from descriptions and vice versa, and much more. Vertex AI's [Vertex AI Studio](https://cloud.google.com/vertex-ai/docs/generative-ai/learn/generative-ai-studio) console allows the user to rapidly create, test, and train generative AI models on the cloud in a safe and secure setting, see our overview in [this tutorial](/notebooks/GenAI/VertexAIStudioGCP.ipynb). The studio also has ready-to-use models all contained within the [Model Garden](https://cloud.google.com/vertex-ai/docs/start/explore-models). These models range from foundation models, fine-tunable models, and task-specific solutions. -- To learn more about Gen AI on GCP take a look at our [GenAI tutorials](/notebooks/GenAI) that go over several GCP products such as [Gemini](/notebooks/GenAI/Gemini_Intro.ipynb) and [Vector Search](/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb) and other tools like [Langchain](/notebooks/GenAI/langchain_on_vertex.ipynb) and [Huggingface](/notebooks/GenAI/GCP_GenAI_Huggingface.ipynb) to deploy, train, prompt, and implement techniques like [Retrieval-Augmented Generation (RAG)](/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb) to GenAI models. +- To learn more about Gen AI on GCP take a look at our [GenAI tutorials](/notebooks/GenAI) that go over several GCP products such as [Gemini](/notebooks/GenAI/Gemini_Intro.ipynb) and [Vector Search](/notebooks/GenAI/GCP_Pubmed_chatbot.ipynb) and other tools like [Langchain](/notebooks/GenAI/langchain_on_vertex.ipynb) and [Huggingface](/notebooks/GenAI/GCP_GenAI_Huggingface.ipynb) to deploy, train, prompt, and implement techniques like [Retrieval-Augmented Generation (RAG)](/notebooks/GenAI/Pubmed_RAG_chatbot.ipynb) to GenAI models. - Google also provides many generative AI tutorials hosted on [GitHub](https://github.com/GoogleCloudPlatform/generative-ai/tree/main). Some example they provide are under [language here](https://github.com/GoogleCloudPlatform/generative-ai/tree/main/language). ## **Medical Image Segmentation** From b7a9e24ad286641f03647195f26b415c2226d0e4 Mon Sep 17 00:00:00 2001 From: zbyosufzai <145053952+zbyosufzai@users.noreply.github.com> Date: Wed, 13 Mar 2024 11:22:46 -0400 Subject: [PATCH 4/5] fixed link --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index feac5a1..f4c2eca 100644 --- a/README.md +++ b/README.md @@ -31,7 +31,7 @@ There are a lot of ways to run workflows on GCP. Here we list a few possibilitie - The simplest method is probably to spin up a Compute Engine instance, and run your command interactively, or using `screen` or, as a [startup script](https://cloud.google.com/compute/docs/instances/startup-scripts/linux) attached as metadata. - You could also run your pipeline via a Vertex AI notebook, either by splitting out each command as a different block, or by running a workflow manager (Nextflow etc.). [Schedule notebooks](https://codelabs.developers.google.com/vertex_notebook_executor#0) to let them run longer. You can find a nice tutorial for using managed notebooks [here](https://codelabs.developers.google.com/vertex_notebook_executor#0). Note that there is now a difference between `managed notebooks` and `user managed notebooks`. The `managed notebooks` have more features and can be scheduled, but give you less control for conda environments/installs. -- You can interact with [Google Batch](https://cloud.google.com/batch/docs/get-started), or the [Google Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest) using a workflow manager like [Nextflow](https://cloud.google.com/life-sciences/docs/tutorials/Nextflow), [Snakemake](https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/googlebatch.html), or [Cromwell](https://github.com/GoogleCloudPlatform/rad-lab/tree/main/modules/genomics_cromwell). We currently have example notebooks for both [Nextflow and Snakemake that use the Life Sciences API](/notebooks/LifeSciencesAPI/), as well as [Google Batch with Nextflow](/notebooks/GoogleBatch/Nextflow) as well as a [local version of Snakemake run via Pangolin](/notebooks/pangolin). +- You can interact with [Google Batch](https://cloud.google.com/batch/docs/get-started), or the [Google Life Sciences API](https://cloud.google.com/life-sciences/docs/reference/rest) using a workflow manager like [Nextflow](https://cloud.google.com/life-sciences/docs/tutorials/nextflow), [Snakemake](https://snakemake.github.io/snakemake-plugin-catalog/plugins/executor/googlebatch.html), or [Cromwell](https://github.com/GoogleCloudPlatform/rad-lab/tree/main/modules/genomics_cromwell). We currently have example notebooks for both [Nextflow and Snakemake that use the Life Sciences API](/notebooks/LifeSciencesAPI/), as well as [Google Batch with Nextflow](/notebooks/GoogleBatch/Nextflow) as well as a [local version of Snakemake run via Pangolin](/notebooks/pangolin). - You may find other APIs better suite your needs such as the [Google Cloud Healthcare Data Engine](https://cloud.google.com/healthcare). - Most of the notebooks below require just a few CPUs. Start small (maybe 4 CPUs), then scale up as needed. Likewise, when you need a GPU, start with a smaller or older generation GPU (e.g. T4) for testing, then switch to a newer GPU (A100/V100) once you know things will work or you need more compute power. From 9af8b6c736ed08f202dd3371e37ff601d67c8bd5 Mon Sep 17 00:00:00 2001 From: zbyosufzai <145053952+zbyosufzai@users.noreply.github.com> Date: Wed, 13 Mar 2024 11:32:34 -0400 Subject: [PATCH 5/5] fixed link --- README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/README.md b/README.md index f4c2eca..5c120e4 100644 --- a/README.md +++ b/README.md @@ -73,7 +73,7 @@ Proteomics is the study of the entire set of proteins in a cell, tissue, or orga ## **RNAseq and Transcriptome Assembly** RNA-seq analysis is a high-throughput sequencing method that allows the measurement and characterization of gene expression levels and transcriptome dynamics. Workflows are typically run using workflow managers, and final results can often be visualized in notebooks. -- You can run this [Nextflow tutorial](https://nf-co.re/rnaseq/3.7) for RNAseq a variety of ways on GCP. Following the instructions outlined above, you could use Compute Engine, [Life Sciences API](https://cloud.google.com/life-sciences/docs/tutorials/Nextflow), or a Vertex AI notebook. +- You can run this [Nextflow tutorial](https://nf-co.re/rnaseq/3.7) for RNAseq a variety of ways on GCP. Following the instructions outlined above, you could use Compute Engine, [Life Sciences API](https://cloud.google.com/life-sciences/docs/tutorials/nextflow), or a Vertex AI notebook. - For a notebook version of a complete RNAseq pipeline from Fastq to Salmon quantification go through these tutorials from the [NIGMS Sandbox Project](https://github.com/NIGMS/RNAseqUM) developed by The University of Maine. - Likewise, [This multi-omics module](https://github.com/NIGMS/MultiomicsUND) from the University of North Dakota includes an RNAseq component.