diff --git a/docs/no_toc/01-intro.md b/docs/no_toc/01-intro.md
index 8a9bce7b..6ed130fd 100644
--- a/docs/no_toc/01-intro.md
+++ b/docs/no_toc/01-intro.md
@@ -3,7 +3,7 @@
# Introduction
-
+
This is a *living* course meaning it is constantly changing and being updated. The goal for this course is to be a "wikipedia" of omic data.
If you'd like to contribute, [you can file a pull request on GitHub](https://github.com/fhdsl/Choosing_Genomics_Tools) if you are comfortable with that sort of thing or email `csavonen@fredhutch.org` to ask how to get started.
@@ -18,11 +18,11 @@ _This course is written for individuals who:_
- Want a basic overview of genomic data types.
- Want to find resources for processing and interpreting genomics data.
-
+
## Topics covered:
-
+
## Motivation
@@ -33,17 +33,17 @@ Often students and researchers need to utilize genomic data to reach the next st
Often researchers receive their genomic data processed from another lab or institution, and although they are excited to gain insights from it to inform the next steps of their research, they may not have a practical understanding of how the data they have received came to be or what needs to be done with it.
-
+
As an example, data file formats may not have been covered in their training, and the data they received seems unintelligible and not as straightforward as they hoped.
-
+
This course attempts to give this researcher the basic bearings and resources regarding their data, in hopes that they will be equipped and informed about how to obtain the insights for their researcher they originally aimed to find.
## Curriculum
-
+
**Goal of this course:**
Equip learners with tutorials and resources so they can understand and interpret their genomic data in a way that helps them meet their goals and handle the data properly.
diff --git a/docs/no_toc/02-genomics_overview.md b/docs/no_toc/02-genomics_overview.md
index 095061f0..f13018c3 100644
--- a/docs/no_toc/02-genomics_overview.md
+++ b/docs/no_toc/02-genomics_overview.md
@@ -4,7 +4,7 @@
## Learning Objectives
-
+
In this chapter we are going to cover sequencing and microarray workflows at a very general high level overview to give you a first orientation. As we dive into specific data types and experiments, we will get into more specifics.
Here we will cover the most common file formats. If you have a file format you are dealing with that you don't see listed here, it may be specific to your data type and we will discuss that more in that data type's respective chapter. We still suggest you go through this chapter to give you a basic understanding of commonalities of all genomic data types and workflows
@@ -13,7 +13,7 @@ Here we will cover the most common file formats. If you have a file format you a
In the most general sense, all genomics data when originally collected is raw, it needs to undergo processing to be normalized and ready to use. Then normalized data is generally summarized in a way that is ready for it to be further consumed. Lastly, this summarized data is what can be used to make inferences and create plots and results tables.
-
+
### Basic file formats
diff --git a/docs/no_toc/03-whats-metadata.md b/docs/no_toc/03-whats-metadata.md
index 33258c57..ecf40a79 100644
--- a/docs/no_toc/03-whats-metadata.md
+++ b/docs/no_toc/03-whats-metadata.md
@@ -5,7 +5,7 @@
## Learning Objectives
-
+
## What are metadata?
@@ -15,11 +15,11 @@ Metadata are critically important descriptive information about your data.
Metadata describe how your data came to be, what organism or patient the data are from and include any and every relevant piece of information about the samples in your data set.
-
+
Metadata includes but isn't limited to, the following example categories:
-
+
At this time it's important to note that if you work with human data or samples, your metadata will likely contain personal identifiable information (PII) and protected health information (PHI). It's critical that you protect this information! For more details on this, we encourage you to see our [course about data management](https://jhudatascience.org/Ethical_Data_Handling_for_Cancer_Research/data-privacy.html).
@@ -74,13 +74,13 @@ Toward these two goals, [this excellent article](https://www.tandfonline.com/doi
Note that it is very dangerous to open gene data with Excel. According to @Ziemann2016, approximately one-fifth of papers with Excel gene lists have errors. This happens because Excel wants to interpret everything as a date. We strongly caution against opening (and saving afterward) gene data in Excel.
-
+
### To recap:
-
+
-
+
If you are not the person who has the information needed to create metadata, or you believe that another individual already has this information, make sure you get ahold of the metadata that correspond to your data. It will be critical for you to have to do any sort of meaningful analysis!
diff --git a/docs/no_toc/04-considerations-for-choosing.md b/docs/no_toc/04-considerations-for-choosing.md
index dcc31e08..bc9617df 100644
--- a/docs/no_toc/04-considerations-for-choosing.md
+++ b/docs/no_toc/04-considerations-for-choosing.md
@@ -5,7 +5,7 @@
## Learning Objectives
-
+
## Overview
@@ -13,7 +13,7 @@ In this course, we will introduce you to the fundamentals of various data types
We will discuss the following considerations you should gather information and otherwise ponder when comparing one or more tools for your analysis:
-
+
### Is this tool appropriate for your data type?
diff --git a/docs/no_toc/05-general-data-analysis-tools.md b/docs/no_toc/05-general-data-analysis-tools.md
index fb100034..1acba1c0 100644
--- a/docs/no_toc/05-general-data-analysis-tools.md
+++ b/docs/no_toc/05-general-data-analysis-tools.md
@@ -5,7 +5,7 @@
## Learning Objectives
-
+
## Command Line vs GUI
diff --git a/docs/no_toc/06-sequencing-data.md b/docs/no_toc/06-sequencing-data.md
index 5f5f0023..7b068a75 100644
--- a/docs/no_toc/06-sequencing-data.md
+++ b/docs/no_toc/06-sequencing-data.md
@@ -9,7 +9,7 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f
## Learning Objectives
-
+
In this section, we are going to discuss generalities that apply to all sequencing data. This is meant to be a "primer" for you which data-type specific chapters will build off of to give you more specific and practical steps and advice in regards to your data type.
@@ -31,7 +31,7 @@ At the end of this process, base sequences are called for the samples (with vary
### Inherent biases
-
+
Sequences are not all sequenced or amplified at the same rate. In a perfect world, we could take a simple snapshot of the genome we are interested in and know exactly what and how many sequences were in a sample. But in reality, sequencing methods and the resulting data always have some biases we have to be aware of and hopefully use methods that attempt to mitigate the biases.
diff --git a/docs/no_toc/07-microarray-data.md b/docs/no_toc/07-microarray-data.md
index 8d76f43d..b3db7658 100644
--- a/docs/no_toc/07-microarray-data.md
+++ b/docs/no_toc/07-microarray-data.md
@@ -9,7 +9,7 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f
## Learning Objectives
-
+
## Summary of microarrays
diff --git a/docs/no_toc/08-annotating-genomes.md b/docs/no_toc/08-annotating-genomes.md
index 49cd2c02..b6fd1602 100644
--- a/docs/no_toc/08-annotating-genomes.md
+++ b/docs/no_toc/08-annotating-genomes.md
@@ -9,7 +9,7 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f
## Learning Objectives
-
+
In this chapter, we are going to discuss methods that affect every genomic method and may take up the majority of your time as a genomic data analyst: Annotation.
@@ -21,7 +21,7 @@ Proper annotation requires an understanding of how the annotation data you are u
Every individual organism has its own DNA sequence that is unique to it. So how can we compare organisms to each other? In some studies, sequencing data is obtained and the genome is built de novo (aka from scratch) but this takes a lot of time and computing power. So instead, most genomic studies use the imperfect method of comparing to a reference genome. Reference genomes are built from prior data and available online. They inherently have biases in them. For example, human genomes are generally not made from diverse populations but instead from mostly males of european descent. It is inherently bad for both ethical and scientific reasons to to have [genome references that are too white](https://www.sciencenews.org/article/genetics-race-dna-databases-reference-genome-too-white). For more on the problems with reference genomes, [read this](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-019-1774-4).
-
+
In summary, reference genomes are used for comparison and as a 'source of truth' of sorts, but its important to note that this method is biased and better alternatives need to be realized.
@@ -29,7 +29,7 @@ In summary, reference genomes are used for comparison and as a 'source of truth'
If you are familiar with software development, or have used any app before, you're familiar with software updates and releases. Similarly, the genome has updates and releases as continued cloning and assemblies of organisms teaches us more. In the image below we are showing an example of what a genome version may be noted as (note that different databases may have different terminology -- here we are showing the Genome Reference Consortium). You may also notice on their website it shows the date the genome version was released and what was fixed.
-
+
The details of how genome versions are fixed and released are not really of concern for your data analysis. This is merely to explain that genomes change and what is most important in your analysis is that:
@@ -40,7 +40,7 @@ The details of how genome versions are fixed and released are not really of conc
Although we can't walk you through every organism and database set up, we will walkthrough the files and structure of one example here.
-
+
In the above screenshot, [from Ensembl](https://useast.ensembl.org/info/data/ftp/index.html), it shows different organisms in the rows, but also a variety of different files across the columns. In this example, DNA reference to the DNA sequence of the organism's genome, but cDNA refers to complementary DNA -- aka DNA that has been reversed transcribed from RNA. If you are working with RNA data you may want to use the cDNA file. Whereas CDS files are referring to only coding sequences and ncRNA files are showing only non coding sequences. Most of these files are FASTA files. Gene sets are also their own annotation files called GTF or GFF files. Ensembl provides more [detailed information about what these files contain](https://useast.ensembl.org/info/website/upload/gff.html), but briefly, each row is a feature and has information describing that feature such as genomic locations, the relevant feature type (gene, coding sequence, pseudogene, etc.), and the gene ID or name. For a reminder on what these different file types are [see the previous chapter](http://hutchdatascience.org/Choosing_Genomics_Tools/a-very-general-genomics-overview.html#basic-file-formats).
diff --git a/docs/no_toc/09-DNA.md b/docs/no_toc/09-DNA.md
index dc3be96d..664bb9da 100644
--- a/docs/no_toc/09-DNA.md
+++ b/docs/no_toc/09-DNA.md
@@ -11,7 +11,7 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f
## Learning Objectives
-
+
## What are the goals of analyzing DNA sequences?
@@ -35,7 +35,7 @@ There are several larger goals behind DNA sequencing experiments ranging from as
## Comparison of DNA methods
-
+
There are four DNA sequencing methods discussed in this chapter. The above graph compares WGS, WXS, and Targeted gene sequencing. The last section compares all 4.
1. Whole genome sequencing (WGS)
@@ -81,6 +81,6 @@ If your research question does not pertain to non-coding regions of the genome o
Furthermore, if you are able to narrow down even further what regions are of interest this would be better in terms of cost and detection abilities. A targeted sequencing panel or DNA microarray are ideal for assaying known groups of targets. DNA microarrays are the least costly of all the methods to identify DNA variants, but with both targeted sequencing and DNA microarray you will need to find or create a custom probe or primer set. Ideally a probe or primer set that hits your regions of interest already exists commercially but if not, then you will have to design your own -- which also costs time and money.
-
+
In these upcoming chapters we will discuss in more detail each of these methods, what the data represent, what you need to consider, and what resources you can consult for analyzing your data.
diff --git a/docs/no_toc/09a-WGS-and-WXS.md b/docs/no_toc/09a-WGS-and-WXS.md
index ef295688..b07e55f4 100644
--- a/docs/no_toc/09a-WGS-and-WXS.md
+++ b/docs/no_toc/09a-WGS-and-WXS.md
@@ -9,14 +9,14 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f
## Learning Objectives
-
+
The learning objectives for this course are to explain the use and application of Whole Genome Sequencing (WGS) and Whole Exome Sequencing (WES/WXS) for genomics studies, outline the technical steps in generating WGS/WXS data, and detail the processing steps for analyzing and interpreting WGS/WXS data.
**To familiarize yourself with sequencing methods as a whole, we recommend you read our [chapter on sequencing first](http://hutchdatascience.org/Choosing_Genomics_Tools/sequencing-data.html).**
## WGS and WGS Overview
-
+
The difference between WGS and WXS sequencing is whether or not the open reading frames and thus coding regions are targeted in sequencing. WGS attempts to sequence the whole genome, while for WXS only exons with open reading frames are targeted for sequencing. Both of these methods can be massively beneficial for studying rare and complex diseases.
Thus, whole genome sequencing is a technique to thoroughly analyze the entire DNA sequence of an organism's genome. This includes sequencing all genes both coding and non-coding and all mitochondrial DNA. WGS is beneficial for identifying new and previously established variants related to disease and the regulatory elements of the genome including promoters, enhancers, and silencers. Increasingly non-coding RNAs have also been identified to play a functional role in biological mechanisms and diseases. In order to learn more about the non-coding regions of the genome, WGS is necessary.
@@ -25,7 +25,7 @@ Alternatively whole exome sequencing is used to sequence the coding regions of a
## Advantages and Disadvantages of WGS vs WXS
-
+
We more thoroughly discuss how to choose DNA sequencing methods [here in the previous chapter](http://hutchdatascience.org/Choosing_Genomics_Tools/dna-methods.html), but we will briefly cover this here. Alternatives to WGS include Whole Exome Sequencing (WES/WXS), which sequences the open reading frame areas of the genome or Targeted Gene Sequencing where probes have been designed to sequence only regions of interest.
@@ -33,7 +33,7 @@ The main advantages of WGS include the ability to comprehensively analyze all re
## WGS/WXS Considerations
-
+
Some important considerations for WGS/WXS include:
- What genome you are studying and the size of this genome. Included in this considerations is whether this genome has been sequenced before and you will have a "reference" genome to compare your data against or whether you will have to make a reference genome yourself. [This bioinformatics resource](https://eriqande.github.io/eca-bioinf-handbook/alignment-of-sequence-data-to-a-reference-genome-and-associated-steps.html) provides a great overview of genome alignment.
@@ -52,19 +52,19 @@ For WXS or other targeted sequencing specifically (so not relevant to WGS data),
## DNA Sequencing Pipeline Overview
-
+
In order to create WGS/WXS data, DNA is first extracted from a specific sample type (tissue, blood samples, cells, FFPE blocks, etc.). Either traditional (involving phenol and chloroform) or commercial kits can be used for this first step. Next, the DNA sequencing libraries are prepared. This involves fragmenting the DNA, adding sequencing adapters, and DNA amplification if the input DNA is not of sufficient quantity. Recall that for WXS After sequencing, data is analyzed by converting and aligning reads to generate a BAM file. Many analysis tools will use the BAM file to identify variants, which then generates a VCF file. More information about sequencing and BAM and VCF file generation can be found [here](http://hutchdatascience.org/Choosing_Genomics_Tools/sequencing-data.html) in the sequencing data chapter.
## Data Pre-processing
-
+
Raw sequencing reads are first transformed into a fastq file (more information about fastq files can be found [here](http://hutchdatascience.org/Choosing_Genomics_Tools/sequencing-data.html) in the sequencing data chapter in the Quality Controls section. Then the sequencing reads are aligned to a reference genome to create a BAM file. This data is sorted and merged, and PCR duplicates are identified. The confidence that each read was sequenced correctly is reflected in the base quality score. This score must be recalibrated at this step before variants are called. A final BAM file is thus created. This can be used for future analysis steps include variant or mutation identification, which is outlined on the following slide.
## Commonly Used Tools
-
+
The following link provides the data analysis pipeline written by researchers in the NCI division of the NIH and provides a helpful overview of the typical steps necessary for [WGS analysis](https://docs.gdc.cancer.gov/Data/Bioinformatics_Pipelines/DNA_Seq_Variant_Calling_Pipeline/).
Here are many of the tools and resources used by researchers for analyzing WGS data.
diff --git a/docs/no_toc/10-RNA.md b/docs/no_toc/10-RNA.md
index 0fa4783a..21488653 100644
--- a/docs/no_toc/10-RNA.md
+++ b/docs/no_toc/10-RNA.md
@@ -9,19 +9,19 @@ This chapter is in a beta stage. Some of it has been written with AI tools. If y
## Learning Objectives
-
+
## What are the goals of gene expression analysis?
The goal of gene expression analysis is to quantify RNAs across the genome. This can signify the extent to which various RNAs are being transcribed in a particular cell. This can be informative for what kinds of activity a cell is undergoing and responding to.
-
+
## Comparison of RNA methods
There are three general methods we will discuss for evaluating gene expression. RNA sequencing (whether bulk or single-cell) allows you to catch more targets than gene expression microarrays but is much more costly and computationally intensive. Gene expression microarrays have a lower dynamic range than RNA-seq generally but are much more cost effective. Spatial transcriptomics is the newest method on the block and has the ability to relate gene expression to tissue regions and subpopulations.
-
+
### Single-cell RNA-seq (scRNA-seq):
diff --git a/docs/no_toc/10a-bulk-RNA-seq.md b/docs/no_toc/10a-bulk-RNA-seq.md
index 6cba11da..f79cd13a 100644
--- a/docs/no_toc/10a-bulk-RNA-seq.md
+++ b/docs/no_toc/10a-bulk-RNA-seq.md
@@ -10,17 +10,17 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f
## Learning Objectives
-
+
## Where RNA-seq data comes from
-
+
## RNA-seq workflow
In a very general sense, RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that check the quality of the sequencing done. You may also want to trim and filter out data that is not trustworthy. After you have a set of reliable data, you need to normalize your data. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, differential expression, or any number of other analyses.
-
+
In this chapter we will highlight some of the more popular RNA-seq tools, that are generally suitable for most experiment data but there is no "one size fits all" for computational analysis of RNA-seq data [@Conesa2016]. You may find tools out there that better suit your needs than the ones we discuss here.
@@ -34,7 +34,7 @@ In this chapter we will highlight some of the more popular RNA-seq tools, that a
RNA-seq suffers from a lot of the common sequence biases which are further worsened by PCR amplification steps. We discussed some of the sequence biases in the [previous sequencing chapter]().
-
+
These biases are nicely covered in [this blog by Mike Love](https://mikelove.wordpress.com/2016/09/26/rna-seq-fragment-sequence-bias/) and we'll summarize them here:
@@ -45,7 +45,7 @@ These biases are nicely covered in [this blog by Mike Love](https://mikelove.wor
_Main Takeaway_: When looking for tools, you will want to see if the algorithms or options available attempt to account for these biases in some way.
-
+
## RNA-seq data considerations
@@ -58,7 +58,7 @@ Most of the RNA in the cell is not mRNA or noncoding RNAs of interest, but inste
[This blog by Sitools Biotech does a good summary](https://blog.sitoolsbiotech.com/2019/08/ribo-depletion-rna-seq-ribosomal-rna-depletion-method-works-best/) of the pros and cons of either selection method.
-
+
### Transcriptome mapping
@@ -80,7 +80,7 @@ _Examples of pseudo aligners_:
These strategies are discussed at greater length [in this excellent manuscript by Conesa et al, 2016](https://genomebiology.biomedcentral.com/articles/10.1186/s13059-016-0881-8).
-
+
### Abundance measures
@@ -116,11 +116,11 @@ TPM has gained a popularity in recent years because it is more intuitive to unde
> When you use TPM, the sum of all TPMs in each sample are the same. This makes it easier to compare the proportion of reads that mapped to a gene in each sample. In contrast, with RPKM and FPKM, the sum of the normalized reads in each sample may be different, and this makes it harder to compare samples directly.
-
+
### RNA-seq downstream analysis tools
-
+
- [ComplexHeatmap](https://bioconductor.org/packages/release/bioc/html/ComplexHeatmap.html#:~:text=Complex%20heatmaps%20are%20efficient%20to,and%20supports%20various%20annotation%20graphics.) is great for visualizations
- [DESEq2](https://www.bioconductor.org/packages/release/bioc/html/DESeq2.html) and [edgeR](https://www.bioconductor.org/packages/release/bioc/html/edgeR.html) are great for differential expression analyses.
diff --git a/docs/no_toc/10b-single-cell-RNA-seq.md b/docs/no_toc/10b-single-cell-RNA-seq.md
index 450a8a12..cefcd671 100644
--- a/docs/no_toc/10b-single-cell-RNA-seq.md
+++ b/docs/no_toc/10b-single-cell-RNA-seq.md
@@ -9,17 +9,17 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f
## Learning Objectives
-
+
## Where single-cell RNA-seq data comes from
-
+
As opposed to bulk RNA-seq which can only tell us about tissue level and within patient variation, single-cell RNA-seq is able to tell us cell to cell variation in transcriptomics including intra-tumor heterogeneity.
Single cell RNA-seq can give us cell level transcriptional profiles. Whereas bulk RNA-seq masks cell to cell heterogeneity. If your research questions require cell-level transcriptional information, single-cell RNA-seq will on interest to you.
-
+
## Single-cell RNA-seq data types
@@ -30,9 +30,9 @@ There are broadly two categories of single-cell RNA-seq data methods we will dis
Depending on your goals for your single cell RNA-seq analysis, you may want to choose one method over the other.
-
+
-
+
(Material borrowed from [@AlexsLemonade2022]).
@@ -40,13 +40,13 @@ Depending on your goals for your single cell RNA-seq analysis, you may want to c
Often Tag based single cell RNA-seq methods will include not only a cell barcode for cell identification but will also have a unique molecular identifier (UMI) for original molecule identification. The idea behind the UMIs is it is a way to have insight into the original snapshot of the cell and potentially combat PCR amplification biases.
-
+
## Single cell RNA-seq tools
There are a lot of scRNA-seq tools for various steps along the way.
-
+
In a very general sense, single cell RNA-seq workflows involves first quantification/alignment. You will also need to conduct quality control steps that may involve using UMIs to check for what’s detected, detecting doublets (also known as duplets), and using this information to filter out data that is not trustworthy. [Doublets are transcriptome data generated from two cells](https://bioconductor.org/books/3.15/OSCA.advanced/doublet-detection.html), and an undesired technical artifact when single cell RNA-seq workflows want data representing a single cell at a time. After you have a set of reliable data, you need to normalize your data. Single cell data is highly skewed - a lot of genes barely or not detected and a few genes that are detected a lot. After data has been normalized you are ready to conduct your downstream analyses. This will be highly dependent on the original goals and questions of your experiment. It may include dimension reduction, cell classification, differential expression, detecting cell trajectories or any number of other analyses.
diff --git a/docs/no_toc/10c-spatial-transcriptomics.md b/docs/no_toc/10c-spatial-transcriptomics.md
index a4b3719c..ed7b927d 100644
--- a/docs/no_toc/10c-spatial-transcriptomics.md
+++ b/docs/no_toc/10c-spatial-transcriptomics.md
@@ -2,13 +2,9 @@
# Spatial transcriptomics
-::: warning
-This chapter has currently been written by ChatGPT and has not been verified by experts. We need help writing and reviewing it! If you wish to contribute, please [go to this form](https://forms.gle/dqYgmKH8XXE2ohwD9) or our [GitHub page](https://github.com/fhdsl/Choosing_Genomics_Tools).
-:::
-
## Learning objectives
-
+
## What are the goals of spatial transcriptomic analysis?
@@ -24,7 +20,7 @@ Spatial transcriptomics (ST) technologies have been developed as a solution to t
There is a large diversity in approaches to spatially profile tissues. Some ST technologies allow profiling at coarse cellular resolution, where regions of interest (ROIs) are usually identified by a pathologist. These ROIs may include tens of cells up to few hundreds (e.g., GeoMx @bergholtz2021best). Smaller ROI sizes can be found in other technologies such as Visium, where ROIs of 55uM of diameter (or "spots") often contain no more than 10 cells (). For finer cellular resolution, technologies such as MERFISH, SMI, or Xenium, among others, can measure gene expression at individual cells [@yue2023guidebook]. In general, there is a trade-off between the cellular resolution and molecular resolution, as the number of quantified genes and RNA molecules is lower in single-cell level spatial technologies compared to those at the ROI or spot level. In single-cell ST, often a panel of hundreds of genes is quantified, while in "mini-bulk" (ROI/spot) ST, it is possible to genes at the whole transcriptome level.
-
+
In addition to the differences in cellular and molecular, there are fundamental differences in the chemistry used to count the RNA transcripts in the tissue [@wang2021spatial; @yue2023guidebook]. Capture or hybridization of RNA followed by sequencing, or fluorescent imaging are two of the most common techniques used in ST methods. Because of large diversity in resolution and chemical procedures among ST technologies, data collection workflows are equally diverse. Finally, each study poses specific questions that cannot be addressed with traditional scRNA-seq pipelines, requiring customized workflows.
diff --git a/docs/no_toc/11-chromatin.md b/docs/no_toc/11-chromatin.md
index 9c4021da..dbd2ed6d 100644
--- a/docs/no_toc/11-chromatin.md
+++ b/docs/no_toc/11-chromatin.md
@@ -11,7 +11,7 @@ In its existing form, this chapter has been written with AI and still needs furt
## Learning Objectives
-
+
## Why are people interested in chromatin?
@@ -41,7 +41,7 @@ Therefore, understanding the mechanisms that regulate chromatin structure and fu
## Comparison of technologies
-
+
### ATAC-seq:
diff --git a/docs/no_toc/11a-ATAC-Seq.md b/docs/no_toc/11a-ATAC-Seq.md
index cbf404c9..cf9acd53 100644
--- a/docs/no_toc/11a-ATAC-Seq.md
+++ b/docs/no_toc/11a-ATAC-Seq.md
@@ -9,28 +9,28 @@ This chapter is incomplete! If you wish to contribute, please [go to this form](
## Learning Objectives
-
+
## What are the goals of ATAC-Seq analysis?
The goals of ATAC-seq are to identify the accessible regions of the genome in a particular set of samples. These data allow us to understand the relationships between the chromatin accessibility patterns and cell states, and to understand the mechanistic causes and consequences of these chromatin accessibility patterns.
-
+
ATAC-seq data is generated by fragmenting the genome with the Tn5 endonuclease and sequencing the shorter DNA fragments. While most of the genome is associated with protein complexes that preclude the digestion of DNA by Tn5, some regions of the genome have accessible chromatin that can be cleaved by Tn5 resulting in short (<500bp) fragments. These regions of the genome are of biological interest as they are likely to harbor transcription factor binding sites and to constitute cis-regulatory elements, genomic regions that are involved in the regulation of gene expression.
-
+
### What questions can be answered with ATAC-seq?
-
+
## ATAC-Seq general workflow overview
A basic ATAC-seq workflow involves mapping sequence reads to the genome, identifying peaks, assessing data quality, and identifying patterns of interest through clustering or identification of differentially accessible regions or other statistical means.
-
+
### Data quality metrics:
@@ -38,13 +38,13 @@ A basic ATAC-seq workflow involves mapping sequence reads to the genome, identif
#### Sequencing considerations:
-
+
-
+
#### Pre-alignment QC:
-
+
A tool like FastQC or similar should be used to check for GC content, read quality and length, and primer or adapter reads prior to alignment. Trimmomatic is a useful tool for removing primer and adapter sequences if they are present. ATAC-seq experiments should be sequenced with paired-end sequencing, and existing pipelines will expect paired-end. (2 files *_R1.fastq and *_R2.fastq)
@@ -61,7 +61,7 @@ As for all DNA-sequencing based genomics technologies, a sufficient number of ma
#### Post-alignment QC:
-
+
Post alignment: check percent of matched, unmatched, unpaired and duplicated reads. Reads which are duplicated or unmatched should be filtered out.
[Picard](https://broadinstitute.github.io/picard/) is a useful tool for this step.
@@ -71,7 +71,7 @@ Reads on the + strand should be shifted +4bp, reads on the - strand should be sh
ATAC-seq data is often generated using paired end sequencing technologies, which allow for characterization of ATAC-seq fragments. Histograms of these distributions using single base pair resolution bins reveal patterns of enrichment relative to the nucleosome scale of 147bp and the DNA-helix scale ~10.5bp.
-
+
When comparing ATAC-seq samples, it is important to consider the fragment size distributions of the samples being compared. Differences in the distributions could lead to results that are unrelated to biology.
@@ -81,7 +81,7 @@ When comparing ATAC-seq samples, it is important to consider the fragment size d
ATAC-seq peak calling typically makes use of analysis tools developed for ChIP-seq. MACS2 is one of the most common choices for a peak calling tool, but HOMER or other common ChIP-seq peak callers are also acceptable.
An input sample is not typically generated for ATAC-seq as it would be for a ChIP-seq experiment, so the major requirement for the peak caller is that it does not require the input control to call peaks.
-
+
#### Number of peaks:
Although the number of accessible chromatin regions can vary from one cell type to another, there are several regions that appear to be constitutively accessible across most cell types. At least 20,000 peaks can be identified in a high quality experiment. The deeper the sequencing the more peaks will be detected in an ATAC-seq experiments. At a very high sequencing depth some of the statistically significant peaks might not be of biological interest. In an analysis of such data sets the fold enrichment relative to background, or absolute peak signal, in addition to statistical significance, ought to be taken into account.
diff --git a/docs/no_toc/11b-sc-ATAC-Seq.md b/docs/no_toc/11b-sc-ATAC-Seq.md
index c4d2464a..4a39aa53 100644
--- a/docs/no_toc/11b-sc-ATAC-Seq.md
+++ b/docs/no_toc/11b-sc-ATAC-Seq.md
@@ -9,7 +9,7 @@ This chapter is incomplete! If you wish to contribute, please [go to this form](
## Learning Objectives
-
+
## What are the goals of scATAC-seq analysis?
diff --git a/docs/no_toc/11c-ChIP-Seq.md b/docs/no_toc/11c-ChIP-Seq.md
index 14b4c891..ac6aaa4c 100644
--- a/docs/no_toc/11c-ChIP-Seq.md
+++ b/docs/no_toc/11c-ChIP-Seq.md
@@ -9,12 +9,12 @@ This chapter is in a beta stage. If you wish to contribute, please [go to this f
## Learning Objectives
-
+
## What are the goals of ChIP-Seq analysis?
-
+
ChIP-Seq (chromatin immunoprecipitation sequencing) and related approaches are used to identify genome-wide binding sites of specific proteins or protein complexes. Given the diversity of interactions at the DNA-protein interface, sequencing-based methods for targeted chromatin capture have evolved to meet precise research needs and improve the quality of the results. Specifically, ChIP-Seq builds on protein immunoprecipitation techniques (IP) by applying next generation sequencing to a pulldown product. IP followed by sequencing can be applied to any nucleic-acid binding protein for which an antibody is available, including a known or putative transcription factor (TF), chromatin remodeler or histone modifications, or other DNA- or chromatin-specific factors. ChiP-Seq approaches have been honed to increase signal-to-noise, reduce input material, and more specifically map protein-DNA interactions, for example by treating the IP product with a exonuclease that chews-back unprotected DNA end (e.g. ChIP-exo).
diff --git a/docs/no_toc/12-methylation.md b/docs/no_toc/12-methylation.md
index 34bed4c1..c227800f 100644
--- a/docs/no_toc/12-methylation.md
+++ b/docs/no_toc/12-methylation.md
@@ -9,7 +9,7 @@ This chapter is incomplete! If you wish to contribute, please [go to this form](
## Learning Objectives
-
+
## What are the goals of analyzing DNA methylation?
@@ -47,7 +47,7 @@ Because of this, its been proposed that the most appropriate way to model these
## Methylation data workflow
-
+
Like other sequencing methods, you will first need to start by quality control checks. Next, you will also need to align your sequences to the genome. Then, using the base calls, you will need to make methylation calls -- which are methylated and which are not. This details of step depends on whether you are measuring 5mC and/or 5hmC methylation calls. Lastly, you will likely want to use your methylation calls as a whole to identify differentially methylated regions of interest.
diff --git a/docs/no_toc/13-microbiome.md b/docs/no_toc/13-microbiome.md
index fc0707a4..95c4a3f1 100644
--- a/docs/no_toc/13-microbiome.md
+++ b/docs/no_toc/13-microbiome.md
@@ -9,7 +9,8 @@ This chapter is incomplete! If you wish to contribute, please [go to this form](
## Learning Objectives
-
+
+
## A Brief Introduction to Microbiomes
@@ -22,14 +23,16 @@ Microbes are everywhere. We have found these tiny organisms in the deepest regio
If we looked hard enough, I think we’d find them on the surface of the moon and Mars, though they are probably microbes who stowed away on our spacecraft and are now patiently waiting for a drop of water that may or may not ever show up. If we ever colonize those worlds, microbes will be an indispensable ally in creating an environment that could sustain us.
-
+
This figure is adapted from [@Tignat-Perrier2022] under Creative Commons license.
Microbes almost never live alone in the real world (i.e., outside of a laboratory). Rather they exist in communities of different species who are interacting with each other and their environment. Some of these communities will have many different types of organisms, and some will have only a few. Because of the large number of species and individuals involved, no two communities will ever be exactly alike, and quantifying differences between microbial communities is an important area of research at the moment. The types of interactions between organisms are also highly varied. These can include mutualistic relationships, where both organisms benefit from the interaction; parasitic relationships, where one organism exclusively benefits to the detriment of the other; and the full gradient in between.
Microbiome science is everywhere. There are tens of articles published daily in the scientific literature, and many popular science articles and books present these findings to the world of non-scientists. Understanding the promises and limitations of the methods of microbiome science can help avoid misconceptions about microbiome research, and it’s important for practitioners of microbiome science to understand and convey the promise and limitations of our field. Misconceptions abound, frequently arising from the same sources as high-quality popular science microbiome reporting.
- For example, on 5 Feb 2015 an article appeared in the New York Times noting (almost offhand) that Yersinia pestis, the organism responsible for Bubonic plague, had been found in multiple locations throughout the New York City subway system as part of its normal built environment microbiome. This was rapidly followed up on 6 Feb 2015 with an article noting that there was probably not Bubonic plague on the subway system after all, but rather that the approaches used by the research team are limited in their taxonomic resolution, and that likely a harmless close relative of Y. pestis was observed: “What the researchers probably found, [a spokesman for the university where the study originated] said, was bacteria from an unknown species or from organisms that happened to share some gene sequences with the plague bacterium…”.
+```
+For example, on 5 Feb 2015 an article appeared in the New York Times noting (almost offhand) that Yersinia pestis, the organism responsible for Bubonic plague, had been found in multiple locations throughout the New York City subway system as part of its normal built environment microbiome. This was rapidly followed up on 6 Feb 2015 with an article noting that there was probably not Bubonic plague on the subway system after all, but rather that the approaches used by the research team are limited in their taxonomic resolution, and that likely a harmless close relative of Y. pestis was observed: “What the researchers probably found, [a spokesman for the university where the study originated] said, was bacteria from an unknown species or from organisms that happened to share some gene sequences with the plague bacterium…”.
+```
As microbiome services and products are increasingly marketed directly to the public, consumers of microbiome research findings, products, and services need to know how to critically evaluate these offerings and their associated claims. As practitioners in the field, we can help by ensuring that the methods we apply are appropriate and reliable, and that we make our work accessible.
diff --git a/docs/no_toc/404.html b/docs/no_toc/404.html
index 44f92ae3..e7a12887 100644
--- a/docs/no_toc/404.html
+++ b/docs/no_toc/404.html
@@ -6,12 +6,11 @@
Page not found | Choosing Genomics Tools
-
+
-
@@ -31,7 +30,6 @@
-
@@ -49,31 +47,26 @@
-
-
-
-
-
- Page not found | Title
-
-
-
-
+
+
+
+
-
-
-
+