diff --git a/01-intro.Rmd b/01-intro.Rmd index 3b31860..d16b845 100644 --- a/01-intro.Rmd +++ b/01-intro.Rmd @@ -21,7 +21,7 @@ One of the key challenges in cancer informatics is dealing with and managing the This course is intended for researchers, including postdocs and students, with limited to intermediate experience with informatics research. The conceptual material will also be useful for those in management roles who are collecting data and using informatics pipelines. -```{r, fig.align='center', echo = FALSE, fig.alt= "For individuals whom: Have no formal training in informatics. Are relatively new to informatics. Want to learn the basics of computers and shared computing resources. Want guidance for choosing computing options", out.width= "100%"} +```{r for_individuals_who, fig.align='center', echo = FALSE, fig.alt= "For individuals whom: Have no formal training in informatics. Are relatively new to informatics. Want to learn the basics of computers and shared computing resources. Want guidance for choosing computing options", out.width= "100%"} ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHEAbES1Agjy7Ex2IpVAoUIoBFbsq0/edit#slide=id.g11db82d2864_1_65") @@ -29,7 +29,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHE ## Topics covered: -```{r, fig.align='center', echo = FALSE, fig.alt= "Concepts discussed in the Computing for Cancer Informatics course: How computer hardware and software work. Computing resources designed for research Data sizes and computational capacity. Guidance about computing resource decisions. How shared computing resources work. Etiquette for shared computing resources.", out.width= "100%"} +```{r topics_covered, fig.align='center', echo = FALSE, fig.alt= "Concepts discussed in the Computing for Cancer Informatics course: How computer hardware and software work. Computing resources designed for research Data sizes and computational capacity. Guidance about computing resource decisions. How shared computing resources work. Etiquette for shared computing resources.", out.width= "100%"} ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHEAbES1Agjy7Ex2IpVAoUIoBFbsq0/edit#slide=id.g11db82d2864_1_81") ``` @@ -38,6 +38,6 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHE The course will cover key underlying principles and concepts in computing. We will go over concrete discussions of the differences between cloud and local computing. The course will also highlight a number of computing options and describe etiquette basics for using shared resources. -```{r, fig.align='center', echo = FALSE, fig.alt= "Overall Course Learning Objectives. This course will demonstrate how to: 1.Recognize various data management systems especially for cancer research related data, 2.Compare and make informed decisions about computation platforms (including economic considerations),3.Implement best practices for data security and privacy, 4. Share data safely and securely in a variety of contexts,5.Handle IRB and data access requests,6.Apply ethical consideration in data management workflows", out.width= "100%"} +```{r learning_objectives, fig.align='center', echo = FALSE, fig.alt= "Overall Course Learning Objectives. This course will demonstrate how to: 1.Recognize various data management systems especially for cancer research related data, 2.Compare and make informed decisions about computation platforms (including economic considerations),3.Implement best practices for data security and privacy, 4. Share data safely and securely in a variety of contexts,5.Handle IRB and data access requests,6.Apply ethical consideration in data management workflows", out.width= "100%"} ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHEAbES1Agjy7Ex2IpVAoUIoBFbsq0/edit#slide=id.gf5f8818810_1_5") ``` diff --git a/03-Binary_data_to_computations.Rmd b/03-Binary_data_to_computations.Rmd index 94292bb..ac46581 100644 --- a/03-Binary_data_to_computations.Rmd +++ b/03-Binary_data_to_computations.Rmd @@ -113,7 +113,7 @@ Previously, back when a university might have one single computer, as they were ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHEAbES1Agjy7Ex2IpVAoUIoBFbsq0/edit#slide=id.gf96b1d997a_0_1") ``` -There were many [different kinds](https://www.jkmscott.net/data/Punched%20Cards.html) of punch cards over time, see @scott_collection_2016 for a collection. +There were many [different kinds](https://www.jkmscott.net/data/PunchedCards/PunchedCards.html) of punch cards over time, see @scott_collection_2016 for a collection. @@ -125,7 +125,7 @@ Also check out @hardware_history_2021 for really interesting and more extensive Also, here is some fascinating additional reading on the role of women as computer operators starting in the 1940s. Initially computer science was actually thought of as a field for women, however this changed over time (and now women and gender minorities are hopefully becoming more represented) : -* [Article titled: Woman pioneered computer programming. Then men took their industry over](https://timeline.com/women-pioneered-computer-programming-then-men-took-their-industry-over-c2959b822523) [@visions_women_2017] +* [Article titled: Woman pioneered computer programming. Then men took their industry over](https://pages.memoryoftheworld.org/library/Josh%20O%27Connor/Women%20pioneered%20computer%20programming.%20Then%20men%20took%20their%20industry%20over_%20%28321%29/Women%20pioneered%20computer%20programming.%20Then%20-%20Josh%20O%27Connor.pdf) [@visions_women_2017] * [Article titled: Untold History of AI: Invisible Women Programmed America's First Electronic Computer The “human computers” who operated ENIAC have received little credit](https://spectrum.ieee.org/untold-history-of-ai-invisible-woman-programmed-americas-first-electronic-computer) [@untold_2019] diff --git a/04-Computing_Systems.Rmd b/04-Computing_Systems.Rmd index 98ecc68..6936bf6 100644 --- a/04-Computing_Systems.Rmd +++ b/04-Computing_Systems.Rmd @@ -295,7 +295,7 @@ Many of us use cloud storage regularly for Google Docs and backing up photos usi Furthermore, this also allows for more opportunity to scale your work to a larger extent, as there is generally more computing capacity possible with most cloud resources [@cloudvstrad]. -Companies like Amazon, Google, Microsoft Azure, and others provide cloud computing resources. **Somewhere these companies have clusters of computers that paying customers use through the internet.** In addition to these commercial options, there are newer national government funded resource options like [Jetstream](https://portal.xsede.org/jetstream) (described in the next section). We will compare computing options in another chapter coming up. +Companies like Amazon, Google, Microsoft Azure, and others provide cloud computing resources. **Somewhere these companies have clusters of computers that paying customers use through the internet.** In addition to these commercial options, there are occasionally national government funded resource options like Texas Advanced Computing Center (TACC) and others previously funded by the former project called [XSEDE](https://portal.xsede.org/) (described in the next section). We will compare computing options in another chapter coming up. @@ -308,7 +308,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHE It's important to remember that all of the shared computing options that we previously described involve a [data center](https://en.wikipedia.org/wiki/Data_center) where are large number of computers are physically housed. -```{r, fig.align='center', echo = FALSE, fig.alt= "Examples of servers or shared computers include clusters that may exist at your institution or national computing resources like Xsede.", out.width= "100%"} +```{r, fig.align='center', echo = FALSE, fig.alt= "Examples of servers or shared computers include clusters that may exist at your institution or national computing resources", out.width= "100%"} ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHEAbES1Agjy7Ex2IpVAoUIoBFbsq0/edit#slide=id.gf9c252d058_0_23") ``` @@ -319,26 +319,30 @@ You may have access to a [HPC (which stands for High Performance Computing) clus If your university or institution has a HPC [cluster](https://en.wikipedia.org/wiki/Computer_cluster), this means that they have a group of computers acting like a server that people can use to store data or assist with intensive computations. Often institutions can support the cost of many computers within an HPC cluster. This means that multiple computers will simultaneously perform different parts of the computing required for a given task, thus significantly speeding up the process compared to you trying to perform the task on just your computer! -If your institute doesn't have a shared computing resource like the HPCs we just described, you could also consider a national resource option like [Xsede](https://www.xsede.org/). -[Xsede](https://www.xsede.org/) is led by the University of Illinois National Center for Supercomputing Applications (NCSA) and includes 18 other partnering institutions (which are mostly other universities). Through this partnership, they currently support 16 supercomputers. Universities and non-profit researchers in the United States can request access to their computational and data storage resources. See [here](https://portal.xsede.org/allocations/resource-info) for descriptions of the available resources. +If your institute doesn't have a shared computing resource like the HPCs we just described, you could also consider a national resource option like the [Texas Advanced Computing Center (TACC)](https://en.wikipedia.org/wiki/Texas_Advanced_Computing_Center) which was funded by the National Science Foundation (NSF) [XSEDE](https://www.xsede.org/) program. +Universities and non-profit researchers in the United States can request access to their computational and data storage resources. Other resource options include: +- [San Diego Supercomputer Center (SDSC)](https://www.sdsc.edu/) at the University of California, San Diego +- [National Institute for Computational Sciences (NICS)](https://www.nics.tennessee.edu/), at the University of Tennessee, Knoxville +- [Pittsburgh Supercomputing Center (PSC)](https://www.psc.edu/) at the Carnegie Mellon University and University of Pittsburgh -Here you can see a photo of Stampede2, one of the supercomputers that members of Xsede can utilize. -```{r, fig.align='center', echo = FALSE, fig.alt= "An image of Stampede2 one of the supercomputers that members of Xsede can use.", out.width= "100%"} +Here you can see a photo of Stampede2, one of the supercomputers that members of TACC could utilize (it has now been replaced with Stampede3). + +```{r, fig.align='center', echo = FALSE, fig.alt= "An image of Stampede2 one of the supercomputers that members of TACC could use.", out.width= "100%"} ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHEAbES1Agjy7Ex2IpVAoUIoBFbsq0/edit#slide=id.gf9c252d058_0_63") ``` -[[source](https://www.xsede.org/ecosystem/resources)] +[[source](https://www.xsede.org/)] > Stampede2, generously funded by the National Science Foundation (NSF) through award ACI-1134872, is one of the Texas Advanced Computing Center (TACC), University of Texas at Austin's flagship supercomputers. -See [here](https://portal.xsede.org/tacc-stampede2) for more information about how you could possibly connect to and utilize Stampede2. +See [this article about Stampede2 and the transition to Stampede3](https://tacc.utexas.edu/news/latest-news/2023/07/24/taccs-new-stampede3-advances-nsf-supercomputing-ecosystem/) for more information about their resources and see [their getting started website](https://tacc.utexas.edu/use-tacc/getting-started) on how you could possibly use their resources. -Importantly when you use shared computers like national resources like Stampede2 available through Xsede, as well as institutional HPCs, you will share these resources with many other people and so you need to learn the proper etiquette for using and sharing these resources. We will discuss this more in a coming chapter. +Importantly when you use shared computers like national resources like [Stampede2](https://tacc.utexas.edu/systems/stampede2/) and [Stampede3](https://docs.tacc.utexas.edu/hpc/stampede3/), as well as institutional HPCs, you will share these resources with many other people and so you need to learn the proper etiquette for using and sharing these resources. We will discuss this more in a coming chapter. -However, there is also now an option to access the different XSEDE computing resources through a cloud environment option called [Jetstream2](https://jetstream-cloud.org/). +There is also an option to access national computing resources through a cloud environment option called [Jetstream2](https://jetstream-cloud.org/). Here is a video about Jetstream2: @@ -348,7 +352,6 @@ knitr::include_url("https://www.youtube.com/embed/NQ3flxJANTw") - We will also discuss how the use of these various computing options differ in the next chapters. Importantly there are also some computing platforms that have been especially designed for scientists and specific types of researchers, so it is also useful to know about these options. @@ -367,6 +370,6 @@ In conclusion, here are some of the major take-home messages: 7) A supercomputer is a computer that has much more storage, memory, and computing capacity than a typical personal computer. Supercomputers are generally much more expensive than using a group of more typical computers that together would have the same collective computing and storage capacity. 8) There are two general types of servers: clusters and grids. Cluster approaches work by having several computers working on pieces of the same task simultaneously in a method called parallel computing. Grid approaches work by having different types of computers working on different tasks. 9) Cloud computing is essentially the use of many servers accessed through the internet. This is often more reliable because there are many servers to use, even if one other users are performing large tasks or if a server goes down. We will talk more about the pros and cons of this option in the coming chapters. -10) If your institute doesn't provide you access to a shared computing resource and you don't want to use a commercial cloud option, you could consider options like [Xsede](https://www.xsede.org/) and or [Jetstream2](https://jetstream-cloud.org/), which is a national resource that you can request access to. +10) If your institute doesn't provide you access to a shared computing resource and you don't want to use a commercial cloud option, you could consider options like [TACC](https://en.wikipedia.org/wiki/Texas_Advanced_Computing_Center) and or [Jetstream2](https://jetstream-cloud.org/), which is a national resource that you can request access to. diff --git a/05-Shared_computing_etiquette.Rmd b/05-Shared_computing_etiquette.Rmd index d853dca..48d0257 100644 --- a/05-Shared_computing_etiquette.Rmd +++ b/05-Shared_computing_etiquette.Rmd @@ -30,7 +30,7 @@ Each cluster or other shared computing resource will have different rules and re One major aspect to consider is keeping the computers in the cluster safe from harm. You wouldn't want to lose your precious data stored on the cluster and neither would your colleagues! - - Use a good [secure password](https://its.lafayette.edu/policies/strongpasswords/) that is not easy for someone else to guess. + - Use a good [secure password](https://help.lafayette.edu/guidelines-for-strong-passwords/) that is not easy for someone else to guess. Some people suggest using sentences that are easy for you to remember, you could consider a line of lyrics from song or poem that you like, or maybe a movie. Modify part of it to include symbols and numbers [@passwords]. @@ -138,7 +138,7 @@ Typically a program is used to schedule jobs. Remember that jobs are the individ Such job scheduling programs assign jobs to available node resources as they become available and if they have the required resources to meet the job. These programs have their own commands for running jobs, checking resources, and checking jobs. Remember to use the management system to run your jobs using the compute nodes not the login nodes (nodes for users to log in). There are often nodes set up for transferring files as well. -In the case of the JHPCE, a program called Sun Grid Engine (SGE) is used, but there are others job management programs. See [here](https://jhpce.jhu.edu/wp-content/uploads/2021/06/JHPCE-Overview-2021-10.pdf) for more information on how people use SGE for the JHPCE shared resource. +In the case of the JHPCE, a program called Sun Grid Engine (SGE) is used, but there are others job management programs. See [here](https://jhpce.jhu.edu/orient/images/sge-orient.pdf) for more information on how people use SGE for the JHPCE shared resource. ### Specifying memory (RAM) needs diff --git a/06-General_Platforms.Rmd b/06-General_Platforms.Rmd index 106d72e..d5ec267 100644 --- a/06-General_Platforms.Rmd +++ b/06-General_Platforms.Rmd @@ -48,7 +48,7 @@ The [ISB-CRC](https://isb-cgc.appspot.com/) platform allows users to browse and ### Galaxy -This section was written by [Jeremy Goecks](https://goeckslab.org/people/jeremy.html): +This section was written by [Jeremy Goecks](https://www.goeckslab.org/members/jeremy-goecks.html): Galaxy is a web-based computational workbench that connects analysis tools, biomedical datasets, computing resources, a graphical user interface, and a programmatic API. Galaxy (https://galaxyproject.org/) enables accessible, reproducible, and collaborative biomedical data science by anyone regardless of their informatics expertise. There are more than 8,000 analysis tools and 200 visualizations integrated into Galaxy that can be used to process a wide variety of biomedical datasets. This includes tools for analyzing genomic, transcriptomic (RNA-seq), proteomic, metabolomic, microbiome, and imaging datasets, tool suites for single-cell omics and machine learning, and thousands of more tools. Galaxy’s graphical user interface can be used with only a web browser, and there is a programmatic API for performing scripted and automated analyses with Galaxy. @@ -112,9 +112,9 @@ It relies on Terra for the cloud based compute environment, Dockstore for stand ## CyVerse -[CyVerse](https://cyverse.rocks/about) is a similar computing platform that also offers computing resources for storing, sharing, and working with data with a graphical interface, as well as an API. Computing was previously offered using the cloud computing platform from CyVerse called [Atmosphere](https://cyverse.org/refocusing-atmosphere-to-support-cloud-native-development), which relied on users using virtual machines. Users will now use a new version of Atmosphere with partnership with [Jetstream](https://jetstream-cloud.org/). This allows users to use containers for easier collaboration and also offers US users more computing power and storage. Originally called iPlant Collaborative, it was started by a funding from the National Science Foundation (NSF) to support life sciences research, particularly to support ecology, biodiversity, sustainability, and agriculture research. It is led by the University of Arizona, the Texas Advanced Computing Center, and Cold Spring Harbor Laboratory. It offers access to an environment for performing analyses with Jupyter (for Python mostly) and RStudio (for R mostly) and a variety of tools for Genomic data analysis. See [here](https://cyverse.atlassian.net/wiki/spaces/DEapps/pages/241882146/List+of+Applications) for a list of applications that are supported by CyVerse. Note that you can also install tools on both platforms. Both CyVerse and Galaxy offer lots of helpful documentation, to help users get started with informatics analyses. +[CyVerse](https://cyverse.rocks/about) is a similar computing platform that also offers computing resources for storing, sharing, and working with data with a graphical interface, as well as an API. Computing was previously offered using the cloud computing platform from CyVerse called [Atmosphere](https://cyverse.org/news/refocusing-atmosphere-support-cloud-native-development), which relied on users using virtual machines. Users will now use a new version of Atmosphere with partnership with [Jetstream](https://jetstream-cloud.org/). This allows users to use containers for easier collaboration and also offers US users more computing power and storage. Originally called iPlant Collaborative, it was started by a funding from the National Science Foundation (NSF) to support life sciences research, particularly to support ecology, biodiversity, sustainability, and agriculture research. It is led by the University of Arizona, the Texas Advanced Computing Center, and Cold Spring Harbor Laboratory. It offers access to an environment for performing analyses with Jupyter (for Python mostly) and RStudio (for R mostly) and a variety of tools for Genomic data analysis. See [here](https://cyverse.atlassian.net/wiki/spaces/DEapps/pages/241882146/List+of+Applications) for a list of applications that are supported by CyVerse. Note that you can also install tools on both platforms. Both CyVerse and Galaxy offer lots of helpful documentation, to help users get started with informatics analyses. -See [here](https://learning.cyverse.org/en/latest/) to learn more. +See [here](https://learning.cyverse.org/) to learn more. ```{r, fig.align='center', echo = FALSE, fig.alt= "CyVerse graphical interface for performing analyses", out.width= "100%"} ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHEAbES1Agjy7Ex2IpVAoUIoBFbsq0/edit#slide=id.gfd56752f25_0_0 diff --git a/07-Computing_Decisions.Rmd b/07-Computing_Decisions.Rmd index 0e5adbb..97a5f2d 100644 --- a/07-Computing_Decisions.Rmd +++ b/07-Computing_Decisions.Rmd @@ -111,7 +111,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHE If you plan on working with multiple data modalities like for example both imaging and genomic data, consider computing options that are flexible for such analyses. Some cloud computing options designed for research are more general and supportive of this, such as [Galaxy](https://galaxyproject.org/) to some extent, as well as to a larger extent [SciServer](https://jhudatascience.org/Computing_for_Cancer_Informatics/research-platforms.html), [Jetstream](https://jetstream-cloud.org/), or [CyVerse](https://cyverse.rocks/about). -```{r, fig.align='center', echo = FALSE, fig.alt= "If you might use a variety of types of data, local shared resources or more general remote shared resources allow for this. This can be important to pay attention to when considering which shared resource option to choose. SciServer, CyVerse, XSEDE and Jetstream are examples of platforms that provide allow for storage and usage of multiple types of data.", out.width="100%"} +```{r, fig.align='center', echo = FALSE, fig.alt= "If you might use a variety of types of data, local shared resources or more general remote shared resources allow for this. This can be important to pay attention to when considering which shared resource option to choose. SciServer, CyVerse, TACC and Jetstream are examples of platforms that provide allow for storage and usage of multiple types of data.", out.width="100%"} ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHEAbES1Agjy7Ex2IpVAoUIoBFbsq0/edit#slide=id.g117b5133acc_71_114") ``` @@ -179,7 +179,7 @@ ottrpal::include_slide("https://docs.google.com/presentation/d/1B4LwuvgA6aUopOHE ## Local shared resources vs remote shared resources -Often the first decision about cloud computing resources is based on determining if your personal computer can handle your work. If you have already determined that you indeed need more computing power than your personal lab computers, the next decision is between local shared resources (like an institutional server) and remote shared resources, which include more traditional sharing resource options such as XSEDE, as well as more modern cloud computing options. See @fischer_jetstream_2019 for more information about the costs and benefits of using a cloud option like [Jetstream](https://jetstream-cloud.org/) for your computational work. We will now discuss some questions that you might ask yourself to make the decision between local and remote shared computing resources. +Often the first decision about cloud computing resources is based on determining if your personal computer can handle your work. If you have already determined that you indeed need more computing power than your personal lab computers, the next decision is between local shared resources (like an institutional server) and remote shared resources, which include more traditional sharing resource options such as [TACC](https://tacc.utexas.edu/), as well as more modern cloud computing options. See @fischer_jetstream_2019 for more information about the costs and benefits of using a cloud option like [Jetstream](https://jetstream-cloud.org/) for your computational work. We will now discuss some questions that you might ask yourself to make the decision between local and remote shared computing resources. ```{r, fig.align='center', echo = FALSE, fig.alt= "Choosing between local and remote shared resource options.", out.width="100%"} diff --git a/book.bib b/book.bib index 2ba7a30..6cd9963 100644 --- a/book.bib +++ b/book.bib @@ -776,7 +776,7 @@ @article{hinkson_comprehensive_2017 @misc{visions_women_2017, title = {Women pioneered computer programming. {Then} men took their industry over.}, - url = {https://timeline.com/women-pioneered-computer-programming-then-men-took-their-industry-over-c2959b822523}, + url = {https://pages.memoryoftheworld.org/library/Josh%20O%27Connor/Women%20pioneered%20computer%20programming.%20Then%20men%20took%20their%20industry%20over_%20%28321%29/Women%20pioneered%20computer%20programming.%20Then%20-%20Josh%20O%27Connor.pdf}, abstract = {How “computer girls” gave way to tech bros}, language = {en}, urldate = {2021-12-14}, @@ -1237,4 +1237,4 @@ @misc{scott_collection_2016 author = {Scott, Jim}, year = {2016} } - \ No newline at end of file + diff --git a/resources/exclude_files.txt b/resources/exclude_files.txt index 5525a40..22c440e 100644 --- a/resources/exclude_files.txt +++ b/resources/exclude_files.txt @@ -6,3 +6,4 @@ CONTRIBUTING.md LICENSE.md code_of_conduct.md README.md +getting_started.md diff --git a/resources/ignore-urls.txt b/resources/ignore-urls.txt new file mode 100644 index 0000000..2e10bc2 --- /dev/null +++ b/resources/ignore-urls.txt @@ -0,0 +1 @@ +https://en.wikipedia.org/wiki/Adder_(electronics