diff --git a/docs/no_toc/01-intro.md b/docs/no_toc/01-intro.md index 8a9bce7b..6ed130fd 100644 --- a/docs/no_toc/01-intro.md +++ b/docs/no_toc/01-intro.md @@ -3,7 +3,7 @@ # Introduction -Title image: Choosing Genomics Tools Written by: Candace Savonen. Part of the ITN (ITCR training Network) and created through the Johns Hopkins Data Science Lab +Title image: Choosing Genomics Tools Written by: Candace Savonen. Part of the ITN (ITCR training Network) and created through the Johns Hopkins Data Science Lab This is a *living* course meaning it is constantly changing and being updated. The goal for this course is to be a "wikipedia" of omic data. If you'd like to contribute, [you can file a pull request on GitHub](https://github.com/fhdsl/Choosing_Genomics_Tools) if you are comfortable with that sort of thing or email `csavonen@fredhutch.org` to ask how to get started. @@ -18,11 +18,11 @@ _This course is written for individuals who:_ - Want a basic overview of genomic data types. - Want to find resources for processing and interpreting genomics data. -For individuals who: Have genomic data and don’t know what to do with it. Want a basic overview of their genomic data type. Want to find resources for processing and interpreting genomics data +For individuals who: Have genomic data and don’t know what to do with it. Want a basic overview of their genomic data type. Want to find resources for processing and interpreting genomics data ## Topics covered: - + ## Motivation @@ -33,17 +33,17 @@ Often students and researchers need to utilize genomic data to reach the next st Often researchers receive their genomic data processed from another lab or institution, and although they are excited to gain insights from it to inform the next steps of their research, they may not have a practical understanding of how the data they have received came to be or what needs to be done with it. -This researcher is very excited because they’ve received their genomic data and are ready to gain insights from it to inform the next steps of their research. An email sent to them says ‘your data are ready’ +This researcher is very excited because they’ve received their genomic data and are ready to gain insights from it to inform the next steps of their research. An email sent to them says ‘your data are ready’ As an example, data file formats may not have been covered in their training, and the data they received seems unintelligible and not as straightforward as they hoped. -The researcher may attempt to open their newly received genomic data and be completely perplexed by the file formats or what these data even represent. The researcher says ‘What is this and what do I do with it’ +The researcher may attempt to open their newly received genomic data and be completely perplexed by the file formats or what these data even represent. The researcher says ‘What is this and what do I do with it’ This course attempts to give this researcher the basic bearings and resources regarding their data, in hopes that they will be equipped and informed about how to obtain the insights for their researcher they originally aimed to find. ## Curriculum -Overall Course Learning Objectives. This course will demonstrate how too: Understand the overall workflow associated with processing their genomic data  Be aware of caveats based on their specific type of data. Find tutorials to help them process their genomic data. Choose tools for processing their genomic data. Choose tools for interpreting their genomic data +Overall Course Learning Objectives. This course will demonstrate how too: Understand the overall workflow associated with processing their genomic data  Be aware of caveats based on their specific type of data. Find tutorials to help them process their genomic data. Choose tools for processing their genomic data. Choose tools for interpreting their genomic data **Goal of this course:** Equip learners with tutorials and resources so they can understand and interpret their genomic data in a way that helps them meet their goals and handle the data properly. diff --git a/docs/no_toc/02-genomics_overview.md b/docs/no_toc/02-genomics_overview.md index 095061f0..f13018c3 100644 --- a/docs/no_toc/02-genomics_overview.md +++ b/docs/no_toc/02-genomics_overview.md @@ -4,7 +4,7 @@ ## Learning Objectives -Learning objectives This chapter will demonstrate how to: Understand what will be covered in this course. Find information about your particular file format +Learning objectives This chapter will demonstrate how to: Understand what will be covered in this course. Find information about your particular file format In this chapter we are going to cover sequencing and microarray workflows at a very general high level overview to give you a first orientation. As we dive into specific data types and experiments, we will get into more specifics. Here we will cover the most common file formats. If you have a file format you are dealing with that you don't see listed here, it may be specific to your data type and we will discuss that more in that data type's respective chapter. We still suggest you go through this chapter to give you a basic understanding of commonalities of all genomic data types and workflows @@ -13,7 +13,7 @@ Here we will cover the most common file formats. If you have a file format you a In the most general sense, all genomics data when originally collected is raw, it needs to undergo processing to be normalized and ready to use. Then normalized data is generally summarized in a way that is ready for it to be further consumed. Lastly, this summarized data is what can be used to make inferences and create plots and results tables. -In the most general sense, all genomics data when originally collected is raw, it needs to undergo processing to be normalized and ready to use. Then normalized data is generally summarized in a way that is ready for it to be further consumed. Lastly this summarized data is what can be used to make inferences and create plots and results tables. +In the most general sense, all genomics data when originally collected is raw, it needs to undergo processing to be normalized and ready to use. Then normalized data is generally summarized in a way that is ready for it to be further consumed. Lastly this summarized data is what can be used to make inferences and create plots and results tables. ### Basic file formats diff --git a/docs/no_toc/03-whats-metadata.md b/docs/no_toc/03-whats-metadata.md index 33258c57..ecf40a79 100644 --- a/docs/no_toc/03-whats-metadata.md +++ b/docs/no_toc/03-whats-metadata.md @@ -5,7 +5,7 @@ ## Learning Objectives -Learning objectives This chapter will demonstrate how to: Understand what metadata are and why they are so critical. Learn the basics of creating crystal clear, readable metadata +Learning objectives This chapter will demonstrate how to: Understand what metadata are and why they are so critical. Learn the basics of creating crystal clear, readable metadata ## What are metadata? @@ -15,11 +15,11 @@ Metadata are critically important descriptive information about your data. Metadata describe how your data came to be, what organism or patient the data are from and include any and every relevant piece of information about the samples in your data set. -Question: What are metadata? Answer: Anything and everything that should be known about your samples! Samples labeled A-H are in test tubes. A corresponding spreadsheet has metadata such as mouse id, processing date, treatment and etc. The researcher says ‘I know everything I need to know about these samples from their metadata!’ +Question: What are metadata? Answer: Anything and everything that should be known about your samples! Samples labeled A-H are in test tubes. A corresponding spreadsheet has metadata such as mouse id, processing date, treatment and etc. The researcher says ‘I know everything I need to know about these samples from their metadata!’ Metadata includes but isn't limited to, the following example categories: -Examples of metadata categories: Patient/organism of origin, Patient/organism information including: Demographics, Disease state, Treatment state, Time point (if applicable). Metadata also includes: Processing information like: Batch information and Processing details (for example: Isolation methods: Poly-A vs Ribo-minus) Metadata is Anything that should be known about the samples and their handling! +Examples of metadata categories: Patient/organism of origin, Patient/organism information including: Demographics, Disease state, Treatment state, Time point (if applicable). Metadata also includes: Processing information like: Batch information and Processing details (for example: Isolation methods: Poly-A vs Ribo-minus) Metadata is Anything that should be known about the samples and their handling!
At this time it's important to note that if you work with human data or samples, your metadata will likely contain personal identifiable information (PII) and protected health information (PHI). It's critical that you protect this information! For more details on this, we encourage you to see our [course about data management](https://jhudatascience.org/Ethical_Data_Handling_for_Cancer_Research/data-privacy.html). @@ -74,13 +74,13 @@ Toward these two goals, [this excellent article](https://www.tandfonline.com/doi
Note that it is very dangerous to open gene data with Excel. According to @Ziemann2016, approximately one-fifth of papers with Excel gene lists have errors. This happens because Excel wants to interpret everything as a date. We strongly caution against opening (and saving afterward) gene data in Excel. -‘Approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions’ Ziemann, Eren, El-Osta, 2016. On the left, a meme that shows Excel asking ‘is this a date?’ in response to seeing ‘any data at all’. +‘Approximately one-fifth of papers with supplementary Excel gene lists contain erroneous gene name conversions’ Ziemann, Eren, El-Osta, 2016. On the left, a meme that shows Excel asking ‘is this a date?’ in response to seeing ‘any data at all’.
### To recap: -Rules for creating metadata (from Broman & Woo, 2017) Be Consistent. Choose good names for things. Write Dates as YYYY-MM-DD.No Empty Cells. Put Just One Thing in a Cell. Make it a Rectangle +Rules for creating metadata (from Broman & Woo, 2017) Be Consistent. Choose good names for things. Write Dates as YYYY-MM-DD.No Empty Cells. Put Just One Thing in a Cell. Make it a Rectangle -Rules for creating metadata continued  (from Broman & Woo, 2017). Create a Data Dictionary. No Calculations in the Raw Data Files. Do Not Use Font Color or Highlighting as Data. Make Backups. Use Data Validation to Avoid Errors +Rules for creating metadata continued  (from Broman & Woo, 2017). Create a Data Dictionary. No Calculations in the Raw Data Files. Do Not Use Font Color or Highlighting as Data. Make Backups. Use Data Validation to Avoid Errors If you are not the person who has the information needed to create metadata, or you believe that another individual already has this information, make sure you get ahold of the metadata that correspond to your data. It will be critical for you to have to do any sort of meaningful analysis! diff --git a/docs/no_toc/04-considerations-for-choosing.md b/docs/no_toc/04-considerations-for-choosing.md index dcc31e08..bc9617df 100644 --- a/docs/no_toc/04-considerations-for-choosing.md +++ b/docs/no_toc/04-considerations-for-choosing.md @@ -5,7 +5,7 @@ ## Learning Objectives -This chapter will demonstrate how to: Recognize the key aspects of a tool that you should consider when constructing an analysis. Form questions to ask others for advice regarding your data +This chapter will demonstrate how to: Recognize the key aspects of a tool that you should consider when constructing an analysis. Form questions to ask others for advice regarding your data ## Overview @@ -13,7 +13,7 @@ In this course, we will introduce you to the fundamentals of various data types We will discuss the following considerations you should gather information and otherwise ponder when comparing one or more tools for your analysis: -Considerations for choosing tools: Is it appropriate for your data type? Is in an interface or programming language you feel comfortable with? How much computing power do you have? Are there benchmarking papers that compare the tool options? Is the tool well documented and usable? Is the tool well-maintained? Is the tool generally accepted by the field? +Considerations for choosing tools: Is it appropriate for your data type? Is in an interface or programming language you feel comfortable with? How much computing power do you have? Are there benchmarking papers that compare the tool options? Is the tool well documented and usable? Is the tool well-maintained? Is the tool generally accepted by the field? ### Is this tool appropriate for your data type? diff --git a/docs/no_toc/05-general-data-analysis-tools.md b/docs/no_toc/05-general-data-analysis-tools.md index fb100034..1acba1c0 100644 --- a/docs/no_toc/05-general-data-analysis-tools.md +++ b/docs/no_toc/05-general-data-analysis-tools.md @@ -5,7 +5,7 @@ ## Learning Objectives -This chapter will demonstrate how to: Understand the difference between command line and GUI based applications. Understand what R and Python languages are. Find many links to resources where you can learn R or Python +This chapter will demonstrate how to: Understand the difference between command line and GUI based applications. Understand what R and Python languages are. Find many links to resources where you can learn R or Python ## Command Line vs GUI diff --git a/docs/no_toc/06-sequencing-data.md b/docs/no_toc/06-sequencing-data.md index 5f5f0023..7b068a75 100644 --- a/docs/no_toc/06-sequencing-data.md +++ b/docs/no_toc/06-sequencing-data.md @@ -9,7 +9,7 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f ## Learning Objectives -This chapter will demonstrate how to: Understand the very general basics of sequencing data collection and processing workflow. Understand the limitations and strengths of sequencing data in general. +This chapter will demonstrate how to: Understand the very general basics of sequencing data collection and processing workflow. Understand the limitations and strengths of sequencing data in general. In this section, we are going to discuss generalities that apply to all sequencing data. This is meant to be a "primer" for you which data-type specific chapters will build off of to give you more specific and practical steps and advice in regards to your data type. @@ -31,7 +31,7 @@ At the end of this process, base sequences are called for the samples (with vary ### Inherent biases -Sequence related biases GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced). Length bias - longer targets are more likely to be amplified or sequenced. These biases are worsened by PCR amplification! +Sequence related biases GC bias - guanine and cytosine bond melts at higher temp - if a sequence has a lot of G’s and C’s Sequence complexity - certain sequences more likely to have primers bound to them (and more likely to be sequenced). Length bias - longer targets are more likely to be amplified or sequenced. These biases are worsened by PCR amplification! Sequences are not all sequenced or amplified at the same rate. In a perfect world, we could take a simple snapshot of the genome we are interested in and know exactly what and how many sequences were in a sample. But in reality, sequencing methods and the resulting data always have some biases we have to be aware of and hopefully use methods that attempt to mitigate the biases. diff --git a/docs/no_toc/07-microarray-data.md b/docs/no_toc/07-microarray-data.md index 8d76f43d..b3db7658 100644 --- a/docs/no_toc/07-microarray-data.md +++ b/docs/no_toc/07-microarray-data.md @@ -9,7 +9,7 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f ## Learning Objectives -This chapter will demonstrate how to: Understand the very general basics of microarray data collection and processing workflow. Understand the limitations and strengths of microarray data in general. +This chapter will demonstrate how to: Understand the very general basics of microarray data collection and processing workflow. Understand the limitations and strengths of microarray data in general. ## Summary of microarrays diff --git a/docs/no_toc/08-annotating-genomes.md b/docs/no_toc/08-annotating-genomes.md index 49cd2c02..b6fd1602 100644 --- a/docs/no_toc/08-annotating-genomes.md +++ b/docs/no_toc/08-annotating-genomes.md @@ -9,7 +9,7 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f ## Learning Objectives -The learning objectives for this chapter are to: Understand the fundamentals of annotating genomic data. Be aware of how reference genomes and their versions affect annotation. Be able to find genomic annotation from the respective databases +The learning objectives for this chapter are to: Understand the fundamentals of annotating genomic data. Be aware of how reference genomes and their versions affect annotation. Be able to find genomic annotation from the respective databases In this chapter, we are going to discuss methods that affect every genomic method and may take up the majority of your time as a genomic data analyst: Annotation. @@ -21,7 +21,7 @@ Proper annotation requires an understanding of how the annotation data you are u Every individual organism has its own DNA sequence that is unique to it. So how can we compare organisms to each other? In some studies, sequencing data is obtained and the genome is built de novo (aka from scratch) but this takes a lot of time and computing power. So instead, most genomic studies use the imperfect method of comparing to a reference genome. Reference genomes are built from prior data and available online. They inherently have biases in them. For example, human genomes are generally not made from diverse populations but instead from mostly males of european descent. It is inherently bad for both ethical and scientific reasons to to have [genome references that are too white](https://www.sciencenews.org/article/genetics-race-dna-databases-reference-genome-too-white). For more on the problems with reference genomes, [read this](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1774-4). -Reference genomes are often used to make sense of genomic data through comparison. Here we are showing a screenshot of Ensembl's website which has many different organisms and file types +Reference genomes are often used to make sense of genomic data through comparison. Here we are showing a screenshot of Ensembl's website which has many different organisms and file types In summary, reference genomes are used for comparison and as a 'source of truth' of sorts, but its important to note that this method is biased and better alternatives need to be realized. @@ -29,7 +29,7 @@ In summary, reference genomes are used for comparison and as a 'source of truth' If you are familiar with software development, or have used any app before, you're familiar with software updates and releases. Similarly, the genome has updates and releases as continued cloning and assemblies of organisms teaches us more. In the image below we are showing an example of what a genome version may be noted as (note that different databases may have different terminology -- here we are showing the Genome Reference Consortium). You may also notice on their website it shows the date the genome version was released and what was fixed. -Genome assemblies are changed and updated over time much like software packages. +Genome assemblies are changed and updated over time much like software packages. The details of how genome versions are fixed and released are not really of concern for your data analysis. This is merely to explain that genomes change and what is most important in your analysis is that: @@ -40,7 +40,7 @@ The details of how genome versions are fixed and released are not really of conc Although we can't walk you through every organism and database set up, we will walkthrough the files and structure of one example here. -Reference genomes are often used to make sense of genomic data through comparison. Here we are showing a screenshot of Ensembl's website which has many different organisms and file types +Reference genomes are often used to make sense of genomic data through comparison. Here we are showing a screenshot of Ensembl's website which has many different organisms and file types In the above screenshot, [from Ensembl](https://useast.ensembl.org/info/data/ftp/index.html), it shows different organisms in the rows, but also a variety of different files across the columns. In this example, DNA reference to the DNA sequence of the organism's genome, but cDNA refers to complementary DNA -- aka DNA that has been reversed transcribed from RNA. If you are working with RNA data you may want to use the cDNA file. Whereas CDS files are referring to only coding sequences and ncRNA files are showing only non coding sequences. Most of these files are FASTA files. Gene sets are also their own annotation files called GTF or GFF files. Ensembl provides more [detailed information about what these files contain](https://useast.ensembl.org/info/website/upload/gff.html), but briefly, each row is a feature and has information describing that feature such as genomic locations, the relevant feature type (gene, coding sequence, pseudogene, etc.), and the gene ID or name. For a reminder on what these different file types are [see the previous chapter](http://hutchdatascience.org/Choosing_Genomics_Tools/a-very-general-genomics-overview.html#basic-file-formats). diff --git a/docs/no_toc/09-DNA.md b/docs/no_toc/09-DNA.md index dc3be96d..664bb9da 100644 --- a/docs/no_toc/09-DNA.md +++ b/docs/no_toc/09-DNA.md @@ -11,7 +11,7 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f ## Learning Objectives -Learning objectives This chapter will demonstrate how to: Understand the goals and data collection for DNA sequence collection and variant identification. Compare and contrast the following methods: DNA/SNP microarrays, Whole Genome Sequencing, Whole Exome Sequencing, and Targeted Sequencing +Learning objectives This chapter will demonstrate how to: Understand the goals and data collection for DNA sequence collection and variant identification. Compare and contrast the following methods: DNA/SNP microarrays, Whole Genome Sequencing, Whole Exome Sequencing, and Targeted Sequencing ## What are the goals of analyzing DNA sequences? @@ -35,7 +35,7 @@ There are several larger goals behind DNA sequencing experiments ranging from as ## Comparison of DNA methods -Comparing DNA Sequencing Techniques. The most common DNA sequencing techniques are described. Whole genome sequencing coverages all genes and non-coding DNA. 3.2 billion bases are covered when applied to human samples. This the most expensive of the techniques. Depth of coverage required for 99.9% sensitivity is 30X. Whole exome sequencing coverage is the exome or expressed genes. Approximately 45 million bases are sequenced. This is a cost-effective technique. The depth of coverage required for 99.9% sensitivity is 100X. Targeted gene panel sequencing coverages 50-500 genes. 20,000 to 62 million bases are sequenced. This is the most cost-effective technique. Depth of coverage is >500X. +Comparing DNA Sequencing Techniques. The most common DNA sequencing techniques are described. Whole genome sequencing coverages all genes and non-coding DNA. 3.2 billion bases are covered when applied to human samples. This the most expensive of the techniques. Depth of coverage required for 99.9% sensitivity is 30X. Whole exome sequencing coverage is the exome or expressed genes. Approximately 45 million bases are sequenced. This is a cost-effective technique. The depth of coverage required for 99.9% sensitivity is 100X. Targeted gene panel sequencing coverages 50-500 genes. 20,000 to 62 million bases are sequenced. This is the most cost-effective technique. Depth of coverage is >500X. There are four DNA sequencing methods discussed in this chapter. The above graph compares WGS, WXS, and Targeted gene sequencing. The last section compares all 4. 1. Whole genome sequencing (WGS) @@ -81,6 +81,6 @@ If your research question does not pertain to non-coding regions of the genome o Furthermore, if you are able to narrow down even further what regions are of interest this would be better in terms of cost and detection abilities. A targeted sequencing panel or DNA microarray are ideal for assaying known groups of targets. DNA microarrays are the least costly of all the methods to identify DNA variants, but with both targeted sequencing and DNA microarray you will need to find or create a custom probe or primer set. Ideally a probe or primer set that hits your regions of interest already exists commercially but if not, then you will have to design your own -- which also costs time and money. -There are three general methods we will discuss for evaluating DNA sequences. Whole Genome Sequencing (WGS) assays more of the genome than other methods but is much more costly and computationally intensive. Depending on your goals WGS may be overkill. SNP microarrays on the other hand, are much more cost effective but are not able to be used for exploratory purposes. Whole Exome Sequencing (WXS or WES) and other targeted sequencing methods allow you to survey regions of the genome in way that is more cost effective and potentially at higher depths. +There are three general methods we will discuss for evaluating DNA sequences. Whole Genome Sequencing (WGS) assays more of the genome than other methods but is much more costly and computationally intensive. Depending on your goals WGS may be overkill. SNP microarrays on the other hand, are much more cost effective but are not able to be used for exploratory purposes. Whole Exome Sequencing (WXS or WES) and other targeted sequencing methods allow you to survey regions of the genome in way that is more cost effective and potentially at higher depths. In these upcoming chapters we will discuss in more detail each of these methods, what the data represent, what you need to consider, and what resources you can consult for analyzing your data. diff --git a/docs/no_toc/09a-WGS-and-WXS.md b/docs/no_toc/09a-WGS-and-WXS.md index ef295688..b07e55f4 100644 --- a/docs/no_toc/09a-WGS-and-WXS.md +++ b/docs/no_toc/09a-WGS-and-WXS.md @@ -9,14 +9,14 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f ## Learning Objectives -The learning objectives for this course are to: 1 Define the uses and applications of WGS/WXS 2 Describe the steps for generating WGS/WXS data 3 Understand the data analysis workflow for WGS/WXS +The learning objectives for this course are to: 1 Define the uses and applications of WGS/WXS 2 Describe the steps for generating WGS/WXS data 3 Understand the data analysis workflow for WGS/WXS The learning objectives for this course are to explain the use and application of Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WES/WXS) for genomics studies, outline the technical steps in generating WGS/WXS data, and detail the processing steps for analyzing and interpreting WGS/WXS data. **To familiarize yourself with sequencing methods as a whole, we recommend you read our [chapter on sequencing first](http://hutchdatascience.org/Choosing_Genomics_Tools/sequencing-data.html).** ## WGS and WGS Overview -Whole genome sequencing overview, Process of determining entirety of DNA sequence of organism’s genome at single time. Includes sequencing all chromosomal data and DNA from mitochondria. Used to identify functional variants associated with disease +Whole genome sequencing overview, Process of determining entirety of DNA sequence of organism’s genome at single time. Includes sequencing all chromosomal data and DNA from mitochondria. Used to identify functional variants associated with disease The difference between WGS and WXS sequencing is whether or not the open reading frames and thus coding regions are targeted in sequencing. WGS attempts to sequence the whole genome, while for WXS only exons with open reading frames are targeted for sequencing. Both of these methods can be massively beneficial for studying rare and complex diseases. Thus, whole genome sequencing is a technique to thoroughly analyze the entire DNA sequence of an organism's genome. This includes sequencing all genes both coding and non-coding and all mitochondrial DNA. WGS is beneficial for identifying new and previously established variants related to disease and the regulatory elements of the genome including promoters, enhancers, and silencers. Increasingly non-coding RNAs have also been identified to play a functional role in biological mechanisms and diseases. In order to learn more about the non-coding regions of the genome, WGS is necessary. @@ -25,7 +25,7 @@ Alternatively whole exome sequencing is used to sequence the coding regions of a ## Advantages and Disadvantages of WGS vs WXS -Advantages and Disadvantages of WGS as opposed to WXS: Most complete account of individual variation, Ability to study: Structural rearrangements, Copy number variations, Insertion-Deletions, SNPs, Sequencing repeats, Coding, non-coding, and mitochondrial genome coverage, allows for discovery - identify causative variants; Disadvantages include higher cost and more resources for storing and analyzing data +Advantages and Disadvantages of WGS as opposed to WXS: Most complete account of individual variation, Ability to study: Structural rearrangements, Copy number variations, Insertion-Deletions, SNPs, Sequencing repeats, Coding, non-coding, and mitochondrial genome coverage, allows for discovery - identify causative variants; Disadvantages include higher cost and more resources for storing and analyzing data We more thoroughly discuss how to choose DNA sequencing methods [here in the previous chapter](http://hutchdatascience.org/Choosing_Genomics_Tools/dna-methods.html), but we will briefly cover this here. Alternatives to WGS include Whole Exome Sequencing (WES/WXS), which sequences the open reading frame areas of the genome or Targeted Gene Sequencing where probes have been designed to sequence only regions of interest. @@ -33,7 +33,7 @@ The main advantages of WGS include the ability to comprehensively analyze all re ## WGS/WXS Considerations -WGS/WXS Considerations , Genome type/size, Coverage requirements, Tissue source: fresh tissue, FFPE, blood, Library preparation protocol: PCR vs PCR-free +WGS/WXS Considerations , Genome type/size, Coverage requirements, Tissue source: fresh tissue, FFPE, blood, Library preparation protocol: PCR vs PCR-free Some important considerations for WGS/WXS include: - What genome you are studying and the size of this genome. Included in this considerations is whether this genome has been sequenced before and you will have a "reference" genome to compare your data against or whether you will have to make a reference genome yourself. [This bioinformatics resource](https://eriqande.github.io/eca-bioinf-handbook/alignment-of-sequence-data-to-a-reference-genome-and-associated-steps.html) provides a great overview of genome alignment. @@ -52,19 +52,19 @@ For WXS or other targeted sequencing specifically (so not relevant to WGS data), ## DNA Sequencing Pipeline Overview -Pipeline overview: Step 1: DNA extraction from sample, Step 2: library preparation, Step 3: Sequencing, Step 4: Analysis including data processing from Fastq, aligning reads to generate a BAM file, identifying variants to create a final VCF file +Pipeline overview: Step 1: DNA extraction from sample, Step 2: library preparation, Step 3: Sequencing, Step 4: Analysis including data processing from Fastq, aligning reads to generate a BAM file, identifying variants to create a final VCF file In order to create WGS/WXS data, DNA is first extracted from a specific sample type (tissue, blood samples, cells, FFPE blocks, etc.). Either traditional (involving phenol and chloroform) or commercial kits can be used for this first step. Next, the DNA sequencing libraries are prepared. This involves fragmenting the DNA, adding sequencing adapters, and DNA amplification if the input DNA is not of sufficient quantity. Recall that for WXS After sequencing, data is analyzed by converting and aligning reads to generate a BAM file. Many analysis tools will use the BAM file to identify variants, which then generates a VCF file. More information about sequencing and BAM and VCF file generation can be found [here](http://hutchdatascience.org/Choosing_Genomics_Tools/sequencing-data.html) in the sequencing data chapter. ## Data Pre-processing -Data pre-processing pipeline overview: Raw data from sequencing is transformed into a Fastq file, reads are aligned and a Bam file is created, the data is sorted and merged, duplicates are identified, and the base quality score is recalibrated to create a final BAM file +Data pre-processing pipeline overview: Raw data from sequencing is transformed into a Fastq file, reads are aligned and a Bam file is created, the data is sorted and merged, duplicates are identified, and the base quality score is recalibrated to create a final BAM file Raw sequencing reads are first transformed into a fastq file (more information about fastq files can be found [here](http://hutchdatascience.org/Choosing_Genomics_Tools/sequencing-data.html) in the sequencing data chapter in the Quality Controls section. Then the sequencing reads are aligned to a reference genome to create a BAM file. This data is sorted and merged, and PCR duplicates are identified. The confidence that each read was sequenced correctly is reflected in the base quality score. This score must be recalibrated at this step before variants are called. A final BAM file is thus created. This can be used for future analysis steps include variant or mutation identification, which is outlined on the following slide. ## Commonly Used Tools -Tools commonly used in WGS data analysis +Tools commonly used in WGS data analysis The following link provides the data analysis pipeline written by researchers in the NCI division of the NIH and provides a helpful overview of the typical steps necessary for [WGS analysis](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/). Here are many of the tools and resources used by researchers for analyzing WGS data. diff --git a/docs/no_toc/10-RNA.md b/docs/no_toc/10-RNA.md index 0fa4783a..21488653 100644 --- a/docs/no_toc/10-RNA.md +++ b/docs/no_toc/10-RNA.md @@ -9,19 +9,19 @@ This chapter is in a beta stage. Some of it has been written with AI tools. If y ## Learning Objectives -Learning objectives This chapter will demonstrate how to: Understand the goals and data collection processes for gene expression assays. Compare and contrast the following methods: Bulk RNA-seq, Single cell RNA-seq, Gene expression microarrays +Learning objectives This chapter will demonstrate how to: Understand the goals and data collection processes for gene expression assays. Compare and contrast the following methods: Bulk RNA-seq, Single cell RNA-seq, Gene expression microarrays ## What are the goals of gene expression analysis? The goal of gene expression analysis is to quantify RNAs across the genome. This can signify the extent to which various RNAs are being transcribed in a particular cell. This can be informative for what kinds of activity a cell is undergoing and responding to. -The goal of gene expression analysis is to quantify RNAs on a genome wide level +The goal of gene expression analysis is to quantify RNAs on a genome wide level ## Comparison of RNA methods There are three general methods we will discuss for evaluating gene expression. RNA sequencing (whether bulk or single-cell) allows you to catch more targets than gene expression microarrays but is much more costly and computationally intensive. Gene expression microarrays have a lower dynamic range than RNA-seq generally but are much more cost effective. Spatial transcriptomics is the newest method on the block and has the ability to relate gene expression to tissue regions and subpopulations. -Gene expression microarrays are low cost and low computationally intensive. Bulk RNA-seq is higher cost, requires more computational resources but covers more targets than gene expression arrays. Single cell RNA-seq is higher cost, requires more computational resources but as opposed to Bulk RNA-seq gives single cell resolution. +Gene expression microarrays are low cost and low computationally intensive. Bulk RNA-seq is higher cost, requires more computational resources but covers more targets than gene expression arrays. Single cell RNA-seq is higher cost, requires more computational resources but as opposed to Bulk RNA-seq gives single cell resolution. ### Single-cell RNA-seq (scRNA-seq): diff --git a/docs/no_toc/10a-bulk-RNA-seq.md b/docs/no_toc/10a-bulk-RNA-seq.md index 6cba11da..f79cd13a 100644 --- a/docs/no_toc/10a-bulk-RNA-seq.md +++ b/docs/no_toc/10a-bulk-RNA-seq.md @@ -10,17 +10,17 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f ## Learning Objectives -This chapter will demonstrate how to: Understand the basics of RNA-Seq data collection and processing workflow. Identify the next steps for your particular RNA-seq data. Formulate questions to ask about your RNA-seq data +This chapter will demonstrate how to: Understand the basics of RNA-Seq data collection and processing workflow. Identify the next steps for your particular RNA-seq data. Formulate questions to ask about your RNA-seq data ## Where RNA-seq data comes from -Bulk RNA-seq data is generated by extracting total RNA and then isolating RNA specific species by either Poly-A selection, Ribo depletion, or size selection. The isolated RNA is then converted to cDNA so it is more stable for sequencing. This cDNA is used to construct a sequencing library. Lastly PCR amplification is used to make many copies to use for sequencing. +Bulk RNA-seq data is generated by extracting total RNA and then isolating RNA specific species by either Poly-A selection, Ribo depletion, or size selection. The isolated RNA is then converted to cDNA so it is more stable for sequencing. This cDNA is used to construct a sequencing library. Lastly PCR amplification is used to make many copies to use for sequencing. ## RNA-seq workflow In a very general sense, RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that check the quality of the sequencing done. You may also want to trim and filter out data that is not trustworthy. After you have a set of reliable data, you need to normalize your data. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, differential expression, or any number of other analyses. -In a very general sense, RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that check the quality of the sequencing done. You may also want to trim and filter out data that is not trustworthy. After you have a set of reliable data, you need to normalize your data. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, differential expression, or any number of other analyses. +In a very general sense, RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that check the quality of the sequencing done. You may also want to trim and filter out data that is not trustworthy. After you have a set of reliable data, you need to normalize your data. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, differential expression, or any number of other analyses. In this chapter we will highlight some of the more popular RNA-seq tools, that are generally suitable for most experiment data but there is no "one size fits all" for computational analysis of RNA-seq data [@Conesa2016]. You may find tools out there that better suit your needs than the ones we discuss here. @@ -34,7 +34,7 @@ In this chapter we will highlight some of the more popular RNA-seq tools, that a RNA-seq suffers from a lot of the common sequence biases which are further worsened by PCR amplification steps. We discussed some of the sequence biases in the [previous sequencing chapter](). -RNA-seq data has various biases introduced to the data upon data generation. RNA targets are more likely to be picked up if they are long, if they are from the 3 prime end, have a particular GC content and have a particular read start sequence. +RNA-seq data has various biases introduced to the data upon data generation. RNA targets are more likely to be picked up if they are long, if they are from the 3 prime end, have a particular GC content and have a particular read start sequence. These biases are nicely covered in [this blog by Mike Love](https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/) and we'll summarize them here: @@ -45,7 +45,7 @@ These biases are nicely covered in [this blog by Mike Love](https://mikelove.wor _Main Takeaway_: When looking for tools, you will want to see if the algorithms or options available attempt to account for these biases in some way. -When looking for tools, you will want to see if the algorithms or options available attempt to account for these biases in some way. +When looking for tools, you will want to see if the algorithms or options available attempt to account for these biases in some way. ## RNA-seq data considerations @@ -58,7 +58,7 @@ Most of the RNA in the cell is not mRNA or noncoding RNAs of interest, but inste [This blog by Sitools Biotech does a good summary](https://blog.sitoolsbiotech.com/2019/08/ribo-depletion-rna-seq-ribosomal-rna-depletion-method-works-best/) of the pros and cons of either selection method. -Poly A selection advantages: lower sequencing depth needed. Greater exonic coverage. Disadvantages of Poly A selection is that it does not detect non-polyA transcripts including miRNAs, snoRNAs, and some lncRNAs. It obtains less information on immature transcripts. It performs poorly for degraded RNA or Formalin-Fixed Paraffin-Embedded (FFPE) samples Bias towards 3’ end of transcripts. Cannot be used for prokaryotes. Ribo minus advantages are: It is able to detect small and non-polyadenylated RNAs. It detects long and short transcripts (no 3’ bias). It has better performance on degraded RNa or FFPE samples. It is applicable for prokaryotes. It can be applied toward other abundant RNA. The disadvantages of Ribo minus is that it will collect more intronic reads and immature RNAs (if you are not interested in those). And thus because of the greater quantity of the returned RNA pool. It requires greater sequencing depths. +Poly A selection advantages: lower sequencing depth needed. Greater exonic coverage. Disadvantages of Poly A selection is that it does not detect non-polyA transcripts including miRNAs, snoRNAs, and some lncRNAs. It obtains less information on immature transcripts. It performs poorly for degraded RNA or Formalin-Fixed Paraffin-Embedded (FFPE) samples Bias towards 3’ end of transcripts. Cannot be used for prokaryotes. Ribo minus advantages are: It is able to detect small and non-polyadenylated RNAs. It detects long and short transcripts (no 3’ bias). It has better performance on degraded RNa or FFPE samples. It is applicable for prokaryotes. It can be applied toward other abundant RNA. The disadvantages of Ribo minus is that it will collect more intronic reads and immature RNAs (if you are not interested in those). And thus because of the greater quantity of the returned RNA pool. It requires greater sequencing depths. ### Transcriptome mapping @@ -80,7 +80,7 @@ _Examples of pseudo aligners_: These strategies are discussed at greater length [in this excellent manuscript by Conesa et al, 2016](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8). -TopHat Uses an expectation-maximization approach that estimates transcript abundances. Cufflinks is designed to take advantage of PE reads, and may use GTF information to identify expressed transcripts, or can infer transcripts de novo from the mapping data alone. RSEM, eXpress, Sailfish, Kallisto, and Salmon - Quantify expression from transcriptome mapping and allocate multi-mapping reads among transcript and output within-sample normalized values corrected for sequencing biases. NURD Provides an efficient way of estimating transcript expression from SE reads with a low memory and computing cost. +TopHat Uses an expectation-maximization approach that estimates transcript abundances. Cufflinks is designed to take advantage of PE reads, and may use GTF information to identify expressed transcripts, or can infer transcripts de novo from the mapping data alone. RSEM, eXpress, Sailfish, Kallisto, and Salmon - Quantify expression from transcriptome mapping and allocate multi-mapping reads among transcript and output within-sample normalized values corrected for sequencing biases. NURD Provides an efficient way of estimating transcript expression from SE reads with a low memory and computing cost. ### Abundance measures @@ -116,11 +116,11 @@ TPM has gained a popularity in recent years because it is more intuitive to unde > When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly. -When looking for analysis tools, pay attention to what abundance measures the tool expects to be given with respect to what transformations have already been done to your data (if any). +When looking for analysis tools, pay attention to what abundance measures the tool expects to be given with respect to what transformations have already been done to your data (if any). ### RNA-seq downstream analysis tools -For DESeq: Read count distribution assumes a negative binomial distribution. Raw counts are expected. Replicates are not dealt with.Normalization is done with respect to Library size. For edgeR: Read count distribution assumes Bayesian methods for negative binomial distribution Input is raw counts. Yes, can deal with replicates. Normalization is done with respect to Library size, TMM, RLE, Upperquartile. For baySeq: Read count distribution assumes bayesian methods for negative binomial distribution. Input is raw counts. Can deal with replicates. Library size, Quantile, TMM For NOISeq: Read count distribution is assumed to be Non-parametric. Input is raw or normalized counts Does not deal with replicates. Normalization is done with respect to Library size, RPKM, TMM, Upperquartile +For DESeq: Read count distribution assumes a negative binomial distribution. Raw counts are expected. Replicates are not dealt with.Normalization is done with respect to Library size. For edgeR: Read count distribution assumes Bayesian methods for negative binomial distribution Input is raw counts. Yes, can deal with replicates. Normalization is done with respect to Library size, TMM, RLE, Upperquartile. For baySeq: Read count distribution assumes bayesian methods for negative binomial distribution. Input is raw counts. Can deal with replicates. Library size, Quantile, TMM For NOISeq: Read count distribution is assumed to be Non-parametric. Input is raw or normalized counts Does not deal with replicates. Normalization is done with respect to Library size, RPKM, TMM, Upperquartile - [ComplexHeatmap](https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html#:~:text=Complex%20heatmaps%20are%20efficient%20to,and%20supports%20various%20annotation%20graphics.) is great for visualizations - [DESEq2](https://www.bioconductor.org/packages/release/bioc/html/DESeq2.html) and [edgeR](https://www.bioconductor.org/packages/release/bioc/html/edgeR.html) are great for differential expression analyses. diff --git a/docs/no_toc/10b-single-cell-RNA-seq.md b/docs/no_toc/10b-single-cell-RNA-seq.md index 450a8a12..cefcd671 100644 --- a/docs/no_toc/10b-single-cell-RNA-seq.md +++ b/docs/no_toc/10b-single-cell-RNA-seq.md @@ -9,17 +9,17 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f ## Learning Objectives -This chapter will demonstrate how to: Understand the basics of single cell RNA-Seq data collection and processing workflow. Identify the next steps for your particular single cell RNA-seq data. Formulate questions to ask about your single cell RNA-seq data +This chapter will demonstrate how to: Understand the basics of single cell RNA-Seq data collection and processing workflow. Identify the next steps for your particular single cell RNA-seq data. Formulate questions to ask about your single cell RNA-seq data ## Where single-cell RNA-seq data comes from -As opposed to bulk RNA-seq which can only tell us about tissue level and within patient variation, single-cell RNA-seq is able to tell us cell to cell variation in transcriptomics including intra-tumor heterogeneity +As opposed to bulk RNA-seq which can only tell us about tissue level and within patient variation, single-cell RNA-seq is able to tell us cell to cell variation in transcriptomics including intra-tumor heterogeneity As opposed to bulk RNA-seq which can only tell us about tissue level and within patient variation, single-cell RNA-seq is able to tell us cell to cell variation in transcriptomics including intra-tumor heterogeneity. Single cell RNA-seq can give us cell level transcriptional profiles. Whereas bulk RNA-seq masks cell to cell heterogeneity. If your research questions require cell-level transcriptional information, single-cell RNA-seq will on interest to you. -Single cell RNA-seq can give us cell level transcriptional profiles. Whereas bulk RNA-seq masks cell to cell heterogeneity. +Single cell RNA-seq can give us cell level transcriptional profiles. Whereas bulk RNA-seq masks cell to cell heterogeneity. ## Single-cell RNA-seq data types @@ -30,9 +30,9 @@ There are broadly two categories of single-cell RNA-seq data methods we will dis Depending on your goals for your single cell RNA-seq analysis, you may want to choose one method over the other. -Full length single cell RNA-seq **Pros**: Can be paired end sequencing which has less 3' bias. More complete coverage of transcripts which may be better for transcript discovery purposes. Cons: Is not very efficient (96 wells per plate). Takes longer to run days/weeks depending on the sample size. Expensive. +Full length single cell RNA-seq **Pros**: Can be paired end sequencing which has less 3' bias. More complete coverage of transcripts which may be better for transcript discovery purposes. Cons: Is not very efficient (96 wells per plate). Takes longer to run days/weeks depending on the sample size. Expensive. -Tag based single cell RNA-seq. Pros: Can profile up to millions of cells. Takes less computing power. File storage requirements are smaller. Much less expensive. Cons: More intense 3' bias. Coverage is not as deep as full length single cell RNA-seq +Tag based single cell RNA-seq. Pros: Can profile up to millions of cells. Takes less computing power. File storage requirements are smaller. Much less expensive. Cons: More intense 3' bias. Coverage is not as deep as full length single cell RNA-seq (Material borrowed from [@AlexsLemonade2022]). @@ -40,13 +40,13 @@ Depending on your goals for your single cell RNA-seq analysis, you may want to c Often Tag based single cell RNA-seq methods will include not only a cell barcode for cell identification but will also have a unique molecular identifier (UMI) for original molecule identification. The idea behind the UMIs is it is a way to have insight into the original snapshot of the cell and potentially combat PCR amplification biases. -Tag based single cell RNA-seq. Pros: Can profile up to millions of cells. Takes less computing power. File storage requirements are smaller. Much less expensive. Cons: More intense 3' bias. Coverage is not as deep as full length single cell RNA-seq +Tag based single cell RNA-seq. Pros: Can profile up to millions of cells. Takes less computing power. File storage requirements are smaller. Much less expensive. Cons: More intense 3' bias. Coverage is not as deep as full length single cell RNA-seq ## Single cell RNA-seq tools There are a lot of scRNA-seq tools for various steps along the way. -In a very general sense, single cell RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that may involve using UMIs to check for what’s detected, detecting duplets, and using this information to filter out data that is not trustworthy. After you have a set of reliable data, you need to normalize your data. Single cell data is highly skewed - a lot of genes barely or not detected and a few genes that are detected a lot. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, cell classification, differential expression, detecting cell trajectories or any number of other analyses. +In a very general sense, single cell RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that may involve using UMIs to check for what’s detected, detecting duplets, and using this information to filter out data that is not trustworthy. After you have a set of reliable data, you need to normalize your data. Single cell data is highly skewed - a lot of genes barely or not detected and a few genes that are detected a lot. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, cell classification, differential expression, detecting cell trajectories or any number of other analyses. In a very general sense, single cell RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that may involve using UMIs to check for what’s detected, detecting doublets (also known as duplets), and using this information to filter out data that is not trustworthy. [Doublets are transcriptome data generated from two cells](https://bioconductor.org/books/3.15/OSCA.advanced/doublet-detection.html), and an undesired technical artifact when single cell RNA-seq workflows want data representing a single cell at a time. After you have a set of reliable data, you need to normalize your data. Single cell data is highly skewed - a lot of genes barely or not detected and a few genes that are detected a lot. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, cell classification, differential expression, detecting cell trajectories or any number of other analyses. diff --git a/docs/no_toc/10c-spatial-transcriptomics.md b/docs/no_toc/10c-spatial-transcriptomics.md index a4b3719c..ed7b927d 100644 --- a/docs/no_toc/10c-spatial-transcriptomics.md +++ b/docs/no_toc/10c-spatial-transcriptomics.md @@ -2,13 +2,9 @@ # Spatial transcriptomics -::: warning -This chapter has currently been written by ChatGPT and has not been verified by experts. We need help writing and reviewing it! If you wish to contribute, please [go to this form](https://forms.gle/dqYgmKH8XXE2ohwD9) or our [GitHub page](https://github.com/fhdsl/Choosing_Genomics_Tools). -::: - ## Learning objectives -This chapter will demonstrate how to: Approach collection of spatial transcriptomics data and design a typical analysis pipeline. Adjust your analysis pipeline to the research question, opportunities, and limitations concerning you spatial transcriptomics project. Learn about the questions that can be addressed with spatial transcriptomics data +This chapter will demonstrate how to: Approach collection of spatial transcriptomics data and design a typical analysis pipeline. Adjust your analysis pipeline to the research question, opportunities, and limitations concerning you spatial transcriptomics project. Learn about the questions that can be addressed with spatial transcriptomics data ## What are the goals of spatial transcriptomic analysis? @@ -24,7 +20,7 @@ Spatial transcriptomics (ST) technologies have been developed as a solution to t There is a large diversity in approaches to spatially profile tissues. Some ST technologies allow profiling at coarse cellular resolution, where regions of interest (ROIs) are usually identified by a pathologist. These ROIs may include tens of cells up to few hundreds (e.g., GeoMx @bergholtz2021best). Smaller ROI sizes can be found in other technologies such as Visium, where ROIs of 55uM of diameter (or "spots") often contain no more than 10 cells (). For finer cellular resolution, technologies such as MERFISH, SMI, or Xenium, among others, can measure gene expression at individual cells [@yue2023guidebook]. In general, there is a trade-off between the cellular resolution and molecular resolution, as the number of quantified genes and RNA molecules is lower in single-cell level spatial technologies compared to those at the ROI or spot level. In single-cell ST, often a panel of hundreds of genes is quantified, while in "mini-bulk" (ROI/spot) ST, it is possible to genes at the whole transcriptome level. -A trade-off exists between the cellular resolution and molecular resolution in spatial transcriptomics. +A trade-off exists between the cellular resolution and molecular resolution in spatial transcriptomics. In addition to the differences in cellular and molecular, there are fundamental differences in the chemistry used to count the RNA transcripts in the tissue [@wang2021spatial; @yue2023guidebook]. Capture or hybridization of RNA followed by sequencing, or fluorescent imaging are two of the most common techniques used in ST methods. Because of large diversity in resolution and chemical procedures among ST technologies, data collection workflows are equally diverse. Finally, each study poses specific questions that cannot be addressed with traditional scRNA-seq pipelines, requiring customized workflows. diff --git a/docs/no_toc/11-chromatin.md b/docs/no_toc/11-chromatin.md index 9c4021da..dbd2ed6d 100644 --- a/docs/no_toc/11-chromatin.md +++ b/docs/no_toc/11-chromatin.md @@ -11,7 +11,7 @@ In its existing form, this chapter has been written with AI and still needs furt ## Learning Objectives -This chapter will demonstrate how to: Understand the goals and data collection processes for chromatin assays. Compare and contrast ATAC-seq, Single cell ATAC-seq, ChIP-seq, CUT&RUN and CUT&Tag. +This chapter will demonstrate how to: Understand the goals and data collection processes for chromatin assays. Compare and contrast ATAC-seq, Single cell ATAC-seq, ChIP-seq, CUT&RUN and CUT&Tag. ## Why are people interested in chromatin? @@ -41,7 +41,7 @@ Therefore, understanding the mechanisms that regulate chromatin structure and fu ## Comparison of technologies -A table that compares all the technologies: +A table that compares all the technologies: ### ATAC-seq: diff --git a/docs/no_toc/11a-ATAC-Seq.md b/docs/no_toc/11a-ATAC-Seq.md index cbf404c9..cf9acd53 100644 --- a/docs/no_toc/11a-ATAC-Seq.md +++ b/docs/no_toc/11a-ATAC-Seq.md @@ -9,28 +9,28 @@ This chapter is incomplete! If you wish to contribute, please [go to this form]( ## Learning Objectives -Learning objectives This chapter will demonstrate how to: Understand the basics of ATAC-Seq data collection and processing workflow. Identify the next steps for your particular ATAC-Seq data. Formulate questions to ask about your ATAC-Seq data +Learning objectives This chapter will demonstrate how to: Understand the basics of ATAC-Seq data collection and processing workflow. Identify the next steps for your particular ATAC-Seq data. Formulate questions to ask about your ATAC-Seq data ## What are the goals of ATAC-Seq analysis? The goals of ATAC-seq are to identify the accessible regions of the genome in a particular set of samples. These data allow us to understand the relationships between the chromatin accessibility patterns and cell states, and to understand the mechanistic causes and consequences of these chromatin accessibility patterns. -What does accessibility to chromatin represent? In ATAC-seq we are able to sequence open chromatin and find out DNA sequences where chromatin is accessible for activity. +What does accessibility to chromatin represent? In ATAC-seq we are able to sequence open chromatin and find out DNA sequences where chromatin is accessible for activity. ATAC-seq data is generated by fragmenting the genome with the Tn5 endonuclease and sequencing the shorter DNA fragments. While most of the genome is associated with protein complexes that preclude the digestion of DNA by Tn5, some regions of the genome have accessible chromatin that can be cleaved by Tn5 resulting in short (<500bp) fragments. These regions of the genome are of biological interest as they are likely to harbor transcription factor binding sites and to constitute cis-regulatory elements, genomic regions that are involved in the regulation of gene expression. -Schematic of how Tn5 fragments open chromatin + inserts adapters. This step is important for the quick protocol and low required cell inputs of ATAC-seq +Schematic of how Tn5 fragments open chromatin + inserts adapters. This step is important for the quick protocol and low required cell inputs of ATAC-seq ### What questions can be answered with ATAC-seq? -What types of questions can we ask with ATAC-seq?What regions of the genome have accessible chromatin? How does accessibility differ between biological samples or change over time? What transcription factor motifs or transcription factor footprints can be found at accessible regions of interest? +What types of questions can we ask with ATAC-seq?What regions of the genome have accessible chromatin? How does accessibility differ between biological samples or change over time? What transcription factor motifs or transcription factor footprints can be found at accessible regions of interest? ## ATAC-Seq general workflow overview A basic ATAC-seq workflow involves mapping sequence reads to the genome, identifying peaks, assessing data quality, and identifying patterns of interest through clustering or identification of differentially accessible regions or other statistical means. -A basic ATAC-seq workflow involves mapping sequence reads to the genome, identifying peaks, assessing data quality, and identifying patterns of interest through clustering or identification of differentially accessible regions or other statistical means. +A basic ATAC-seq workflow involves mapping sequence reads to the genome, identifying peaks, assessing data quality, and identifying patterns of interest through clustering or identification of differentially accessible regions or other statistical means. ### Data quality metrics: @@ -38,13 +38,13 @@ A basic ATAC-seq workflow involves mapping sequence reads to the genome, identif #### Sequencing considerations: -Single end sequencing. Cheaper. OK for most standard applications. Paired-end sequencing. More expensive. Useful for looking at nucleosome positioning and transcription factor footprinting +Single end sequencing. Cheaper. OK for most standard applications. Paired-end sequencing. More expensive. Useful for looking at nucleosome positioning and transcription factor footprinting -Single vs. paired end sequencing. Single. Cheaper. OK for most standard applications. Paired-end sequencing. More expensive. Useful for looking at nucleosome positioning and transcription factor footprinting. Read length & read depth. 75bp or more read length (keep in mind nucleosomes are 147bp). ~50 million reads/sample usually recommended +Single vs. paired end sequencing. Single. Cheaper. OK for most standard applications. Paired-end sequencing. More expensive. Useful for looking at nucleosome positioning and transcription factor footprinting. Read length & read depth. 75bp or more read length (keep in mind nucleosomes are 147bp). ~50 million reads/sample usually recommended #### Pre-alignment QC: -Post-sequencing. Signal to noise ratio (link resources at end) Comparison with DNase hypersensitivity datasets (or other computational QC method- check current resources available) +Post-sequencing. Signal to noise ratio (link resources at end) Comparison with DNase hypersensitivity datasets (or other computational QC method- check current resources available) A tool like FastQC or similar should be used to check for GC content, read quality and length, and primer or adapter reads prior to alignment. Trimmomatic is a useful tool for removing primer and adapter sequences if they are present. ATAC-seq experiments should be sequenced with paired-end sequencing, and existing pipelines will expect paired-end. (2 files *_R1.fastq and *_R2.fastq) @@ -61,7 +61,7 @@ As for all DNA-sequencing based genomics technologies, a sufficient number of ma #### Post-alignment QC: -Post-sequencing. Signal to noise ratio (link resources at end) Comparison with DNase hypersensitivity datasets (or other computational QC method- check current resources available) +Post-sequencing. Signal to noise ratio (link resources at end) Comparison with DNase hypersensitivity datasets (or other computational QC method- check current resources available) Post alignment: check percent of matched, unmatched, unpaired and duplicated reads. Reads which are duplicated or unmatched should be filtered out. [Picard](https://broadinstitute.github.io/picard/) is a useful tool for this step. @@ -71,7 +71,7 @@ Reads on the + strand should be shifted +4bp, reads on the - strand should be sh ATAC-seq data is often generated using paired end sequencing technologies, which allow for characterization of ATAC-seq fragments. Histograms of these distributions using single base pair resolution bins reveal patterns of enrichment relative to the nucleosome scale of 147bp and the DNA-helix scale ~10.5bp. -Considerations for quality data: QC checkpoints. Pre-sequencing. Library distribution +Considerations for quality data: QC checkpoints. Pre-sequencing. Library distribution When comparing ATAC-seq samples, it is important to consider the fragment size distributions of the samples being compared. Differences in the distributions could lead to results that are unrelated to biology. @@ -81,7 +81,7 @@ When comparing ATAC-seq samples, it is important to consider the fragment size d ATAC-seq peak calling typically makes use of analysis tools developed for ChIP-seq. MACS2 is one of the most common choices for a peak calling tool, but HOMER or other common ChIP-seq peak callers are also acceptable. An input sample is not typically generated for ATAC-seq as it would be for a ChIP-seq experiment, so the major requirement for the peak caller is that it does not require the input control to call peaks. -Overview of ATAC-seq data analysis pipeline +Overview of ATAC-seq data analysis pipeline #### Number of peaks: Although the number of accessible chromatin regions can vary from one cell type to another, there are several regions that appear to be constitutively accessible across most cell types. At least 20,000 peaks can be identified in a high quality experiment. The deeper the sequencing the more peaks will be detected in an ATAC-seq experiments. At a very high sequencing depth some of the statistically significant peaks might not be of biological interest. In an analysis of such data sets the fold enrichment relative to background, or absolute peak signal, in addition to statistical significance, ought to be taken into account. diff --git a/docs/no_toc/11b-sc-ATAC-Seq.md b/docs/no_toc/11b-sc-ATAC-Seq.md index c4d2464a..4a39aa53 100644 --- a/docs/no_toc/11b-sc-ATAC-Seq.md +++ b/docs/no_toc/11b-sc-ATAC-Seq.md @@ -9,7 +9,7 @@ This chapter is incomplete! If you wish to contribute, please [go to this form]( ## Learning Objectives -Learning objectives This chapter will demonstrate how to: Understand the basics of single cell ATAC-Seq data collection and processing workflow Identify the next steps for your particular single cell ATAC-Seq data. Formulate questions to ask about your single cell ATAC-Seq data +Learning objectives This chapter will demonstrate how to: Understand the basics of single cell ATAC-Seq data collection and processing workflow Identify the next steps for your particular single cell ATAC-Seq data. Formulate questions to ask about your single cell ATAC-Seq data ## What are the goals of scATAC-seq analysis? diff --git a/docs/no_toc/11c-ChIP-Seq.md b/docs/no_toc/11c-ChIP-Seq.md index 14b4c891..ac6aaa4c 100644 --- a/docs/no_toc/11c-ChIP-Seq.md +++ b/docs/no_toc/11c-ChIP-Seq.md @@ -9,12 +9,12 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f ## Learning Objectives -Learning objectives This chapter will demonstrate how to: Understand the basics of ChIP-Seq data collection and processing workflow. Identify the next steps for your particular ChIP-Seq data. Formulate questions to ask about your ChIP-Seq data +Learning objectives This chapter will demonstrate how to: Understand the basics of ChIP-Seq data collection and processing workflow. Identify the next steps for your particular ChIP-Seq data. Formulate questions to ask about your ChIP-Seq data ## What are the goals of ChIP-Seq analysis? -The goal of ChIP-seq is to identify, for a particular DNA binding protein, all of the DNA sequences that it binds to. +The goal of ChIP-seq is to identify, for a particular DNA binding protein, all of the DNA sequences that it binds to. ChIP-Seq (chromatin immunoprecipitation sequencing) and related approaches are used to identify genome-wide binding sites of specific proteins or protein complexes. Given the diversity of interactions at the DNA-protein interface, sequencing-based methods for targeted chromatin capture have evolved to meet precise research needs and improve the quality of the results. Specifically, ChIP-Seq builds on protein immunoprecipitation techniques (IP) by applying next generation sequencing to a pulldown product. IP followed by sequencing can be applied to any nucleic-acid binding protein for which an antibody is available, including a known or putative transcription factor (TF), chromatin remodeler or histone modifications, or other DNA- or chromatin-specific factors. ChiP-Seq approaches have been honed to increase signal-to-noise, reduce input material, and more specifically map protein-DNA interactions, for example by treating the IP product with a exonuclease that chews-back unprotected DNA end (e.g. ChIP-exo). diff --git a/docs/no_toc/12-methylation.md b/docs/no_toc/12-methylation.md index 34bed4c1..c227800f 100644 --- a/docs/no_toc/12-methylation.md +++ b/docs/no_toc/12-methylation.md @@ -9,7 +9,7 @@ This chapter is incomplete! If you wish to contribute, please [go to this form]( ## Learning Objectives -This chapter will demonstrate how to: Understand the basics of bisulfite sequencing data collection and processing workflow. Identify the next steps for your particular bisulfite  sequencing data. Formulate questions to ask about your bisulfite sequencing data +This chapter will demonstrate how to: Understand the basics of bisulfite sequencing data collection and processing workflow. Identify the next steps for your particular bisulfite  sequencing data. Formulate questions to ask about your bisulfite sequencing data ## What are the goals of analyzing DNA methylation? @@ -47,7 +47,7 @@ Because of this, its been proposed that the most appropriate way to model these ## Methylation data workflow -In a very general sense, methylation workflow involves sequence quality control and genome alignment like many other sequencing methods. But next, the data needs to be used to identify methylation calls and calculations of methylation fractions. Lastly, you will likely want to group the methylated bases together to identify what regions of the genome are differentially methylated and of interest. +In a very general sense, methylation workflow involves sequence quality control and genome alignment like many other sequencing methods. But next, the data needs to be used to identify methylation calls and calculations of methylation fractions. Lastly, you will likely want to group the methylated bases together to identify what regions of the genome are differentially methylated and of interest. Like other sequencing methods, you will first need to start by quality control checks. Next, you will also need to align your sequences to the genome. Then, using the base calls, you will need to make methylation calls -- which are methylated and which are not. This details of step depends on whether you are measuring 5mC and/or 5hmC methylation calls. Lastly, you will likely want to use your methylation calls as a whole to identify differentially methylated regions of interest. diff --git a/docs/no_toc/13-microbiome.md b/docs/no_toc/13-microbiome.md index fc0707a4..95c4a3f1 100644 --- a/docs/no_toc/13-microbiome.md +++ b/docs/no_toc/13-microbiome.md @@ -9,7 +9,8 @@ This chapter is incomplete! If you wish to contribute, please [go to this form]( ## Learning Objectives -Learning Objectives +Learning Objectives + ## A Brief Introduction to Microbiomes @@ -22,14 +23,16 @@ Microbes are everywhere. We have found these tiny organisms in the deepest regio If we looked hard enough, I think we’d find them on the surface of the moon and Mars, though they are probably microbes who stowed away on our spacecraft and are now patiently waiting for a drop of water that may or may not ever show up. If we ever colonize those worlds, microbes will be an indispensable ally in creating an environment that could sustain us. -Learning Objectives +Learning Objectives This figure is adapted from [@Tignat-Perrier2022] under Creative Commons license. Microbes almost never live alone in the real world (i.e., outside of a laboratory). Rather they exist in communities of different species who are interacting with each other and their environment. Some of these communities will have many different types of organisms, and some will have only a few. Because of the large number of species and individuals involved, no two communities will ever be exactly alike, and quantifying differences between microbial communities is an important area of research at the moment. The types of interactions between organisms are also highly varied. These can include mutualistic relationships, where both organisms benefit from the interaction; parasitic relationships, where one organism exclusively benefits to the detriment of the other; and the full gradient in between. Microbiome science is everywhere. There are tens of articles published daily in the scientific literature, and many popular science articles and books present these findings to the world of non-scientists. Understanding the promises and limitations of the methods of microbiome science can help avoid misconceptions about microbiome research, and it’s important for practitioners of microbiome science to understand and convey the promise and limitations of our field. Misconceptions abound, frequently arising from the same sources as high-quality popular science microbiome reporting. - For example, on 5 Feb 2015 an article appeared in the New York Times noting (almost offhand) that Yersinia pestis, the organism responsible for Bubonic plague, had been found in multiple locations throughout the New York City subway system as part of its normal built environment microbiome. This was rapidly followed up on 6 Feb 2015 with an article noting that there was probably not Bubonic plague on the subway system after all, but rather that the approaches used by the research team are limited in their taxonomic resolution, and that likely a harmless close relative of Y. pestis was observed: “What the researchers probably found, [a spokesman for the university where the study originated] said, was bacteria from an unknown species or from organisms that happened to share some gene sequences with the plague bacterium…”. +``` +For example, on 5 Feb 2015 an article appeared in the New York Times noting (almost offhand) that Yersinia pestis, the organism responsible for Bubonic plague, had been found in multiple locations throughout the New York City subway system as part of its normal built environment microbiome. This was rapidly followed up on 6 Feb 2015 with an article noting that there was probably not Bubonic plague on the subway system after all, but rather that the approaches used by the research team are limited in their taxonomic resolution, and that likely a harmless close relative of Y. pestis was observed: “What the researchers probably found, [a spokesman for the university where the study originated] said, was bacteria from an unknown species or from organisms that happened to share some gene sequences with the plague bacterium…”. +``` As microbiome services and products are increasingly marketed directly to the public, consumers of microbiome research findings, products, and services need to know how to critically evaluate these offerings and their associated claims. As practitioners in the field, we can help by ensuring that the methods we apply are appropriate and reliable, and that we make our work accessible. diff --git a/docs/no_toc/404.html b/docs/no_toc/404.html index 44f92ae3..e7a12887 100644 --- a/docs/no_toc/404.html +++ b/docs/no_toc/404.html @@ -6,12 +6,11 @@ Page not found | Choosing Genomics Tools - + - @@ -31,7 +30,6 @@ - @@ -49,31 +47,26 @@ - - - - - - Page not found | Title - - - - + + + + - - - +