SIPmg-vignette.Rmd

--- 
title: "Introduction to SIPmg"
author: "Pranav Sampara, Kate Waring, Ryan Ziels"
date: "`r Sys.Date()`"
documentclass: book
cover-image: "images/EMBERR_logo-2.png"
description: |
  SIPmg is a statistical analysis package as part of a stable isotope probing
  (SIP) metagenomics study to identify isotope incorporating MAGs.The package
  utilizes either the qSIP model, ΔBD method, or DeSeq2-based methods such as
  HR-SIP and MW-HR-SIP based on the HTSSIP package. For more information on HTSSIP
  R package, please visit their vignette
output:
  bookdown::html_document
site: bookdown::bookdown_site
github-repo: ZielsLab/SIPmg.github.io
---

# Introduction

The SIPmg R package was designed to enable the exploration and statistical analysis of qSIP metagenomics data. Particularly, this package allows the identification of isotope incorporators and quantifies isotopic enrichment.

Please check out introductory vignettes with some [background information](#background) and a [quick example pipeline](#quick-example), before checking out vignettes on:

[Coverage normalization and scaling based on linear regression model](#scaling-coverage-data)

[Identification of incorporators based on qSIP, ΔBD, or DESeq2 methods](#identifying-incorporators)

<!--chapter:end:index.Rmd-->

# Background {#background}

This section introduces a bit of a background on the methodologies used in this pipeline, such as stable isotope probing, quantitative metagenomics, and quantitative stable isotope probing. References as a point of departure for more reading are also provided.

## Stable isotope probing

Stable isotope probing (SIP) targets active populations within complex microbiomes by providing growth substrates enriched in a heavy stable isotope of carbon, nitrogen, or oxygen. In DNA-SIP, the DNA of microbes actively incorporating the labelled substrate into their DNA during cell division and growth would become increasingly labelled and heavier compared to the microbes not incorporating the substrate into their DNA. Heavier DNA can be physically separated via isopycnic centrifugation on a CsCl gradient, followed by fractionation. Sequencing effort on the “heavier density fractions” could inform the identity and function of microbes that actively incorporated the isotopic substrate into their DNA. Thus, DNA-SIP is a powerful technique to link the identity and function of environmental microbes. 

However, traditional DNA-SIP is only a qualitative technique to identify active microbial populations and no quantitative estimates of isotopic enrichment of DNA are provided. For instance, the distinction between labelled and unlabelled microbes is defined by density gradient regions, limiting the resolution of taxon-specific responses to labelled or unlabelled. Moreover, there exists a guanine-cytosine (GC) bias in conventional SIP studies. The native buoyant density of DNA can differ by as much as 0.05 g/mL over the range of GC contents that can occur in complex communities. Lighter GC-content genomes may not shift into the researcher-assigned “heavier density fractions”, while high GC-content genomes may fall in the “heavier density fractions” without any isotopic substrate assimilation. Thus, a qualitative analysis may mask the true incorporators and, worse, falsely identify non-incorporators as incorporators, skewing the analysis of the study. 


## Quantitative metagenomics

Although metagenomics offers a comprehensive account of the metabolic repertoire in an environment, typically, the data is compositional. Relative abundance of genes or taxa is estimated to determine microbial community dynamics, which is directly influenced by the dynamics of rest of the community. Even if a certain gene remains in the same concentration, its relative abundance changes if a certain other gene increases/decreases in concentration. Absolute abundance of gene/genome can be critical, for instance, in the surveillance of a pathogenic gene/genome in an environment. For quantitative measurements of microbial taxon absolute abundance in a given metagenomic sample, [Hardwick et al. (2018)](https://www.nature.com/articles/s41467-018-05555-0) proposed synthetic spike-ins, or sequins, as internal reference standards. Reference standards in metagenomics can act as scaling factors to evaluate quantitative estimates of individual biological features, in this case, genes or genomes. 

## Quantitative SIP (qSIP)

In quantitative SIP (qSIP), quantitative estimates of isotope incorporation, expressed as atom fraction excess (AFE), are calculated based on a mathematical model. Quantitative estimates of isotopic enrichment of DNA facilitate accurate and sensitive estimates of substrate uptake measurements, and growth and mortality rates of individual taxa in a complex community. Moreover, qSIP analysis is less susceptible to GC and enrichment biases, and taxon abundance, making it amenable to quantify isotopic incorporations in a complex community such as the activated sludge microbiome, with in-situ conditions. Integration of qSIP and quantitative metagenomics (qSIP metagenomics) facilitates the exploration of active microbial populations' taxonomic and metabolic diversity with a quantitative estimate of their abundance and isotope incorporation. 


## For details on SIP, quantitative metagenomics, and qSIP, please refer to the following works:

### SIP
[Neufeld, Josh D., Jyotsna Vohra, Marc G. Dumont, Tillmann Lueders, Mike Manefield, Michael W. Friedrich, and J. Colin Murrell. 2007. “DNA Stable-Isotope Probing.” Nature Protocols 2 (4): 860–66.](https://www.nature.com/articles/nprot.2007.109)

[Neufeld JD, Dumont MG, Vohra J, Murrell JC. Methodological considerations for the use of stable isotope probing in microbial ecology. Microbial Ecology 2007; 53: 435–442.](https://link.springer.com/article/10.1007/s00248-006-9125-x)

[Buckley DH, Huangyutitham V, Hsu S-F, Nelson TA. Stable isotope probing with 15N achieved by disentangling the effects of genome G+ C content and isotope enrichment on DNA density. Applied and environmental microbiology 2007; 73: 3189–3195.](https://journals.asm.org/doi/10.1128/AEM.02609-06)

### Quantitative metagenomics
[Hardwick SA, Chen WY, Wong T, Kanakamedala BS, Deveson IW, Ongley SE, et al. Synthetic microbe communities provide internal reference standards for metagenome sequencing and analysis. Nature Communications 2018; 9: 3096.](https://www.nature.com/articles/s41467-018-05555-0)

### qSIP
[Hungate BA, Mau RL, Schwartz E, Caporaso JG, Dijkstra P, Gestel N van, et al. Quantitative Microbial Ecology through Stable Isotope Probing. Appl Environ Microbiol 2015; 81: 7570–7581.](https://journals.asm.org/doi/10.1128/AEM.02280-15)

[Sieradzki ET, Koch BJ, Greenlon A, Sachdeva R, Malmstrom RR, Mau RL, et al. Measurement error and resolution in quantitative stable isotope probing: implications for experimental design. Msystems 2020; 5: e00151-20.](https://journals.asm.org/doi/10.1128/mSystems.00151-20)

### HTSSIP R package

[Youngblut ND, Barnett SE, Buckley DH. HTSSIP: An R package for analysis of high throughput sequencing data from nucleic acid stable isotope probing (SIP) experiments. PLOS ONE 2018; 13: e0189616.](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0189616)

<!--chapter:end:01-Background.Rmd-->

# Using SIPmg R package {#quick-example}

This R package was designed to facilitate statistical analyses using SIP metagenomic data. Particularly, coverage and taxonomic classification of metagenome-assembled genomes and metadata of fractionated DNA is used to identify incorporators and quantify isotopic enrichment. 


Overall, the package can perform the following:

1. Coverage normalization to absolute concentrations or relative coverage. Scaling of coverages to absolute abundances based on post-fractionation reference standards, called sequins.

2. Statistical analysis based on either of 
  + AFE mathematical model by [Hungate et al. (2015)](https://journals.asm.org/doi/10.1128/AEM.02280-15) or
  + Delta buoyant density $AFE_{ΔBD}$ = $\frac{W_{Lab}- W_{Light}}{I_{max}}$.        
  Where Imax is the maximum linear shift in DNA BD (upon 100% labeling), as        discussed by [Birnie and Rickwood (1978)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2595786/)
  + Analyses as offered by [HTS-SIP](https://cran.r-project.org/web/packages/HTSSIP/index.html)
  
3. Data visualization for identification of isotope incorporators.

## Example workflow

qSIP analysis is typically performed using sequins, MAG coverage, and MAG taxonomy is performed using this R markdown. Briefly, coverages are normalized, MAG coverages are then scaled based on linear regression models from sequin coverage and concentrations, a phyloseq object is created from MAG absolute concentrations and taxonomy (GTDB taxonomy output format is required) data. \n
Note: Sequins that were spiked in the DNA-SIP fractions will be used in scaling and creating linear regression models for evaluating absolute MAG concentrations. Please see `sequin_scaling.R` \n
For normalizing coverage values please see `pooling_funs.R` \n

This R markdown primarily uses `Tidyverse`, `phyloseq`, and `HTSSIP` packages \n
This markdown uses functions from `functions_qSIP_MAGs.R` \n

For this data, MetaSIPSim (developed by [Barnett et al, 2020](https://bmcbioinformatics.biomedcentral.com/articles/10.1186/s12859-020-3372-6)) was used to simulate a data set from a hypothetical SIP workflow performed on the MBARC-26 mock community (as defined in [Singer et al, 2016](https://www.nature.com/articles/sdata201681)). The data set was further augmented with simulated post-fractionation sequin spike-ins and the data set was simulated to have all 26 organisms as incorporators. Some of the 26 organisms also had plasmids, which were assembled as distinct genomic units, adding the total genomic features in the simulated study to be 38

```{r Load libraries, warning = FALSE, message = FALSE}
#Load required libraries

list.of.packages <- c("tidyverse", "HTSSIP", "ggpubr","data.table")
new.packages <- list.of.packages[!(list.of.packages %in% installed.packages()[,"Package"])]
if(length(new.packages)) install.packages(new.packages, quiet = T, dependencies = T, repos = "http://cran.us.r-project.org")

if (!require("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("phyloseq")

library(tidyverse)
library(phyloseq)
library(HTSSIP)
library(ggpubr)
library(SIPmg)


set.seed(seed = 1000)
```

## Load required data 

The following files are required: \n

### MAG coverage data

Pooled coverages data as a comma separated file (.csv file) across samples for `Features` including sequins that were used as spike-ins. The followings columns are required: \n   

  1. Feature: A character string describing the `Feature` label
  
  2. Sample: The label for these n number of columns should be in the format of "'isotope'\_rep\_#\_fraction\_#". For instance, "12C\_rep\_1\_fraction\_1". The number of sample columns should be the same as the product of (replicates, fractions, isotopes)
    
### Sequin metadata

Load the sequins metadata as a comma separated file (.csv file) which has the following columns: \n

  1. Feature: As described above
    
  2. Length: Length of the sequin in bp
    
  3. GC_content: GC content of the sequin
  
  4. Sequence: Sequin nucleotide sequence
  
  5. Concentration: Concentration of the sequin in attamoles/uL
    
### Dilutions data

Load dilutions data as a comma separated file (.csv file) that contains the following columns \n

  1. Sample: Similar to the sample name as described above.
  
  2. Dilution: Dilution of sequins added to the fraction before sequencing.

### Fractions metadata

A fractions file as a comma separated file (.csv file) with the following columns: \n

  1. Replicate: Depends on how many replicates the study has \n
  
  2. Fractions: Typically in the range of 1-24 \n
  
  3. Buoyant_density: As calculated from the refractometer for each fraction and replicate \n
  
  4. Isotope - "12C", "13C", "14N", "15N" etc. \n
  
  5. DNA_concentration \n
  
  6. Sample - In the format "'isotope'\_rep\_#\_fraction\_#". For instance 12C\_rep\_1\_fraction\_1 \n

### GTDB style taxonomy data

A taxonomy file in the GTDB output format (.tsv format). Load the bacteria and archaea taxonomy outputs separately. The markdown requires loading the standard output files from GTDB-Tk separately for bacteria and archaea

### MAG absolute concentrations

MAG absolute concentrations obtained from scaling_sequins.R. \n
`mag_tab` object obtained from the above script is to be used as the input here

### GC content

GC content of the MAGs as a comma separated file (.csv file). The table should contain the following columns: \n

  1. OTU: MAG identifier such as the `Feature` label from the `sequin_scaling.R` script \n
  
  2. GC_content: GC content of the `Feature` in the range of 0-1 \n
  
### Log scale BOOLEAN 

True or False depending on how you would want the MAG coverages to be scaled. Select TRUE if you need MAG concentrations scaled on the log scale \n

### Coefficient of variation

Acceptable coefficient of variation for coverage and detection (eg. 20 for 20 % threshold of coefficient of variation) 

Coverages above the threshold value will be flagged in the plots. Here a value of 20 is used. \n


```{r Load data, warning = FALSE, message = FALSE}
## Load data
#Coverage metadata
#Uncomment if your coverage data is in the format mentioned above for this file. Remains commented if you are using the output from `checkm coverage`
f_tibble <- readr::read_csv("mock_input_data/coverage_metadata.csv")

#Sequins metadata
sequins <- readr::read_csv(file="mock_input_data/sequins_metadata.csv")

#Dilutions data
seq_dil = readr::read_csv(file = "mock_input_data/dilutions_data.csv")

#Log scale BOOLEAN. True or False depending on how you would want the MAG coverages to be scaled. Select TRUE if you need MAG concentrations scaled on the log scale
log_scale = TRUE

#coe_of_variation. Acceptable coefficient of variation for coverage and detection (eg. 20 - for 20 % threshold of coefficient of variation) (Coverages above the threshold value will be flagged in the plots)
coe_of_variation = 20 

#Taxonomy
gtdbtk_bac_summary = readr::read_delim("mock_input_data/gtdbtk.bac120.summary.tsv", 
                                 "\t", escape_double = FALSE, trim_ws = TRUE)
gtdbtk_archaea = readr::read_delim("mock_input_data/gtdbtk.ar122.summary.tsv", 
                             "\t", escape_double = FALSE, trim_ws = TRUE)
#GC content
GC_content <- readr::read_csv(file = "mock_input_data/GC_content.csv")

#Fractions
fractions = readr::read_csv("mock_input_data/fractions.csv")

```

## Coverage normalization and scaling

Coverage data of MAGs (or features of interest) and sequins is scaled using sequin concentration to obtain absolute concentration of MAGs (or features of interest). For this step, either linear regression or robust linear regression can be used. 

If linear regression is used, the function `scale_features_lm` is used. The functions in `sequin_scaling_lm.R` are used for this step.

If robust linear regression is used, the outliers impact on the regression models is minimized. For robust linear regression, `scale_features_rlm` is used. The functions in `sequin_scaling_rlm.R` are used for this step. In this example workflow, robust linear regression is used to scale coverage data in log10 scale. 

For more discussion on the choice of regression method and comparison of the two methods please see this [vignette](#scaling-coverage-data)

### This function uses the following datasets:

1. Coverage data (`f_tibble`)
2. Sequins metadata (`sequins`)
3. Dilution of sequins used to add into the fractions (`seq_dil`)

The regression scaling plots are saved in the project folder within a subfolder `sequin_scaling_plots`

```{r estimate absolute concentrations, warning = FALSE, message = FALSE}
taxonomy_tibble = dplyr::bind_rows(gtdbtk_bac_summary, gtdbtk_archaea) #Combine bacteria and archaea taxonomy files if it has not been done yet
#mag_tab is a tibble with absolute concentrations of MAGs obtained by scaling MAG coverages using linear regression models on sequin coverages and concentration

##Scale MAG coverages to obtain MAG absolute concentrations and save scaling plots in the working directory
#For rlm scaling using scale_features_rlm
#For rlm scaling using scale_features_lm
mag_tab_scaled <- SIPmg::scale_features_rlm(f_tibble, sequins, seq_dil, log_scale, coe_of_variation = coe_of_variation, save_plots = FALSE)
mag_tab = as.matrix(mag_tab_scaled$mag_tab) #Extract absolute abundances as a matrix
```

#### Example of a scaling plot

An example of a scaling plot is as below

```{r example scaling plot, echo = FALSE, warning = FALSE, message = FALSE}
mag_tab_scaled$plots$plots[[1]]
```

## Preparation of scaled data for qSIP analysis

The scaled abundance data, taxonomy data, and metadata is then converted to phyloseq objects and a master phyloseq object is created for qSIP analysis.

```{r Make phyloseq objects, warning = FALSE, message = FALSE}

mag.table = phyloseq::otu_table(mag_tab, taxa_are_rows = TRUE) #Phyloseq OTU table

taxonomy.object = SIPmg::tax.table(taxonomy_tibble) # Create a taxonomy phyloseq object
samples.object = SIPmg::sample.table(fractions) # Create a samples phyloseq object
phylo.qSIP = SIPmg::phylo.table(mag.table, taxonomy.object, samples.object) # Make a phyloseq table for downstream qSIP analysis
```

## Estimation of atom fraction excess

The following steps estimate the atom fraction excess (AFE) in the MAGs. This calculation is based on the mathematical model suggested by [Hungate et al. (2015)](https://journals.asm.org/doi/full/10.1128/AEM.02280-15). In the article, the authors suggest that GC contents of the individual biological features of interest should be accounted for to have a better estimate of AFE. Through the power of metagenomic analysis and the recovery of MAGs, one can estimate the GC content of features of interest and thereby obtain a better estimate of AFE. Thus, here the user would use the GC contents of MAGs in the AFE estimations and obtain isotopic enrichment of individual MAGs. 

An important consideration is that the AFE calculations are based on a mathematical model and are not an absolute estimate of isotopic enrichment. One of the goals of AFE estimation is to determine isotope incorporators. 

In this example workflow, the approach reported in [Hungate et al. (2015)](https://journals.asm.org/doi/full/10.1128/AEM.02280-15) is used to determine incorporators.

For this step, the following are used:

1. Master phyloseq object from the previous step
2. GC content

Thanks to [HTSSIP R package](https://journals.plos.org/plosone/article?id=10.1371/journal.pone.0189616), bootstrapping can be performed faster with multithreading.

```{r Calculate atom fraction excess, warning = FALSE, message = FALSE}
atomX = SIPmg::qSIP_atom_excess_MAGs(phylo.qSIP,
                               control_expr='Isotope=="12C"',
                               treatment_rep='Replicate',
                               Gi = GC_content)
#Bootstrap confidence intervals
df_atomX_boot = SIPmg::qSIP_bootstrap_fcr(atomX, n_boot=10) #Change "parallel = FALSE" to compute using a single-core

```


## Plot the atom fraction excess plot

Plot the atom fraction excess data with 5 % and 95 % quantiles as the confidence limits. In the same way as reported in [Hungate et al. (2015)](https://journals.asm.org/doi/full/10.1128/AEM.02280-15), a MAG is considered an incorporator if the lower confidence interval of its AFE is above zero.

```{r Plot atom fraction excess, warning = FALSE, message = FALSE}
CI_threshold = 0
df_atomX_boot = df_atomX_boot %>%
  dplyr::mutate(Incorporator = A_CI_fcr_low > CI_threshold,
         OTU = reorder(OTU, -A))

(atom_f_excess_plot = ggplot2::ggplot(df_atomX_boot, aes(OTU, A, ymin=A_CI_low, ymax=A_CI_high, color=Incorporator)) +
  geom_pointrange(size=0.25) +
  geom_linerange() +
  geom_hline(yintercept=0, linetype='dashed', alpha=0.5) +
  labs(x='MAGs', y='Atom fraction excess') +
  theme_bw() +
  coord_flip() +
  ggtitle("Isotope incorporating MAGs"))
```

## Check incorporator list

```{r incorporators, warning = FALSE, message = FALSE}

#Get incorporator info
n_incorp = df_atomX_boot %>%
  dplyr::filter(Incorporator == TRUE) %>%
  nrow 
#Get incorporator list
incorporator_list = SIPmg::incorporators_taxonomy(taxonomy = taxonomy_tibble, bootstrapped_AFE_table = df_atomX_boot)
#Print incorporator information
cat('Number of incorporators:', n_incorp, '\n')
print(incorporator_list, n = nrow(incorporator_list))
```

It appears that not all incorporators were identified with the bootstrapping method. This could be due to the model unable to detect the incorporators. Another way to detect incorporators, as mentioned above, is to test with the ΔBD method or with the HR-SIP or MW-HR-SIP methods. For a more detailed discussion, please refer this [vignette](#identifying-incorporators)

```{r, warning = FALSE, message = FALSE}
sessionInfo()
```

<!--chapter:end:02-Quick-example.Rmd-->

# Scaling coverage data {#scaling-coverage-data}

In this example we will look at the two scaling approaches that can be utilized in the SIPmg package. The user can decide either 

1.	A robust linear regression model
2.	An ordinary least squares linear regression model
We will also briefly discuss data filtering methods when there are heavily influencing outliers, as a result of upstream methods.

## Basic approach for scaling

Sequencing effort on each fraction and subsequently obtained coverage is typically used to estimate relative abundance for a feature of interest. However, as we discussed in the introductory vignette, this data is compositional and a quantitative estimate of concentrations can be highly informative. For this purpose, synthetic spike-in standards (sequins) with known concentrations are added before recovery of fractionated DNA, from the CsCl medium into TE buffer or water. For details on sequins please refer to [Hardwick et al. (2018)](https://www.nature.com/articles/s41467-018-05555-0). These reference standards in each fraction, can now be used to estimate the concentration of the feature of interest in each fraction.

The approach used for scaling is detailed in the following steps:

1.	For each fraction, sequin coverages and known concentrations are obtained to make a standard curve.

2.	Limit of detection of sequins, i.e., the lowest concentration where at least one sequin has detectable coverage, is determined for each group of sequin concentrations.

3.	For each group of sequin concentrations, mean, standard deviation, and the coefficient of variation of coverage are determined. Groups of sequins with coefficient of variation greater than the set threshold will be flagged.

4.	Sequin groups with coverage values above the limit of detection and below the coefficient of variation threshold are filtered in preparation for regression.

5.	Linear regression or robust linear regression, based on the user choice, is performed on sequin coverage values as the independent variable and concentration of the sequin group as the dependent variable. The user can also decide on performing log scaling of coverage and concentration values for regression.

6.	The regression parameters are extracted and plots with regression parameters and best fit line are saved for inspection.

7.	Coverage values above the limit of detection and below the coefficient of variation threshold are filtered for further analysis. The rest of the values are flagged and reported.

8.	The filtered values are scaled based on regression parameters to estimate absolute concentrations in each library.

9.	Absolute concentrations of biological features are saved as a fresh dataset for the subsequent isotope incorporator identification and AFE estimation pipeline.

## Choice of regression model

Ordinary least squares regression can be very sensitive to outliers, resulting a poor model performance when the data is contaminated with outliers. Although there are methods, like Cook's distance based filtering to address outliers, it may not always be the best idea to remove these data points.

See [these](https://statisticsbyjim.com/basics/remove-outliers/) [discussions](https://stats.stackexchange.com/questions/175/how-should-outliers-be-dealt-with-in-linear-regression-analysis) and more for better insights into handling outliers. 

Robust linear regression is an alternative to this situation. Robust linear regression assigns appropriate weights to the data points, minimizing the influence of outliers on the model performance, without deleting data. In this pipeline, a Huber loss function from the [`MASS` R package](https://cran.r-project.org/web/packages/MASS/MASS.pdf) was used for robust linear regression.

The user is free to choose between the two methods for regression. 

## Scaling the data

For illustrating the differences in the regression models, a different dataset, that was impacted by one or many upstream methods, will be used than the one in the introductory vignette. 

This dataset can be found here

```{r Load data 2, echo = FALSE, warning = FALSE, message = FALSE}
##Load data
#Coverage data
f_tibble <- readr::read_csv("mock_input_data/coverages_outliers.csv")

#Sequins metadata
sequins <- readr::read_csv(file="mock_input_data/sequin_meta_outliers.csv")

#Dilutions data
seq_dil = readr::read_csv(file = "mock_input_data/seq_dilution_outliers.csv")

#Log scale BOOLEAN. True or False depending on how you would want the MAG coverages to be scaled. Select TRUE if you need MAG concentrations scaled on the log scale
log_scale = TRUE

#coe_of_variation. Acceptable coefficient of variation for coverage and detection (eg. 20 - for 20 % threshold of coefficient of variation) (Coverages above the threshold value will be flagged in the plots)
coe_of_variation = 50

```

For both scaling functions the following data and parameters are required

1.	Coverage data. The output of [checkm coverage](https://github.com/Ecogenomics/CheckM/wiki/Utility-Commands#coverage) command can be directly used.

2.	Sequin dilutions and metadata.

3.	Whether or not log10 scaling to be performed (a BOOLEAN value of TRUE or FALSE).

4.	Coefficient of variation. In this dataset, a higher coefficient of variation compared to the value used in the example workflow of the [introduction vignette](#quick-example) is used to allow data scaling to occur.

If the robust linear regression is chosen, the function is `scale_features_rlm` which is performed the function file `sequin_scaling_rlm.R`.

```{r robust linear regression, warning = FALSE, message = FALSE}
mag_tab_scaled_rlm <- SIPmg::scale_features_rlm(f_tibble, sequins, seq_dil, log_scale, coe_of_variation = coe_of_variation, save_plots = FALSE)
```


If a linear regression is chosen, the function is `scale_features_lm` which is performed by the function file `sequin_scaling_lm.R`. 

If the user chooses linear regression, they can choose to perform outlier filtering based on the traditional Cook’s distance threshold of 4/n (n being the sample size). Only the outliers in the sequin coverage data are flagged as outliers with the Cook’s distance threshold and are filtered out. Later, the remaining data is subjected to linear regression to obtain model parameters and to scale coverage values of feature of interest.


```{r linear regression, warning = FALSE, message = FALSE}
mag_tab_scaled_lm <- SIPmg::scale_features_lm(f_tibble, sequins, seq_dil, log_scale, coe_of_variation = coe_of_variation, cook_filtering = TRUE, save_plots = FALSE)
```

## Comparison of the fits

```{r load images, echo = F, warning = FALSE, message = FALSE}
rlm_example = knitr::include_graphics("rlm-example.png")

lm_example = knitr::include_graphics("lm-example.png")
```

Robust linear regression provides a better model fit compared to the linear regression. Additionally, robust linear regression has less variability compared to the linear regression model.

As discussed previously, removing the outliers may not always be the best idea. However, the outliers can be visualized to assess how "far out" are the outliers and how influential the data points are, negatively influencing the linear regression method.

The `mag_tab_scaled_lm` function provides the plots for the visualization of Cook's distance for the sequin data in each fraction. In the plot below, an outlier is visualized. For this data, the Cook's distance threshold was 0.09, and the outlier had a Cook's distance of 10.8. The highly influential data point could well be the reason the linear regression model had a poor fit.

```{r, outlier sequin coverages, warning = FALSE, message = FALSE}

#Cook's distance threshold of the data set
4/(length(mag_tab_scaled_lm$scale_fac$cooksd[[3]]))

#Cook's distance of the outlier before filtering the data
max(mag_tab_scaled_lm$scale_fac$cooksd[[3]])

#Before filtration
cooksd_example = knitr::include_graphics("cooksd-example.png")
```

Upon filtering the data to remove the outlier, the fit becomes much better for the linear regression model

```{r better fit, warning = FALSE, message = FALSE}
#Linear regression plot with filtered outliers in sequin data
filtered_lm_example = knitr::include_graphics("filtered_lm-example.png")
```

## What if there were no sequins added in the study
In our study, we realized that the sequin-scaled data provided the best correlation (Spearman correlation coefficient = 0.85), compared to methods without the use of sequins. However, not all studies would have access to sequins or could choose not to add sequins. We tested qSIP methods with absolute concentration of MAGs obtained as a product of fraction DNA concentration and either with MAG relative abundance (as performed by [Greenlon et al.](https://journals.asm.org/doi/10.1128/mSystems.00151-20)) or relative coverage (as performed by [Starr et al.](https://journals.asm.org/doi/10.1128/mSphere.00085-21)), or simply with relative coverage, i.e., without converting the MAG coverages to absolute concentration. We found that without sequins, simply using relative coverage provided better specificity (0.99) and good correlation (Spearman correlation coefficient = 0.76), compared to the other two methods. The user could choose any of the three approaches to obtain normalized coverages using the equation `coverage_normalization()` where the parameter approach would be chosen accordingly (either "greenlon", "starr", or "relative_coverage"). In this example, simple relative coverage will be estimated without converting to absolute concentration based on fraction DNA concentrations. The relative coverage table can then be converted to a phyloseq object to run incorporator identification and/or AFE estimation. For more details, please refer to the [incorporator identification section][SIP metagenomics and identification of isotope incorporators]

```{r, estimating relative coverage, warning = FALSE, message = FALSE}
f_tibble <- readr::read_csv("mock_input_data/coverage_metadata.csv")
rel.cov = SIPmg::coverage_normalization(f_tibble = f_tibble, approach = "relative_coverage")
mag.table = phyloseq::otu_table(as.matrix(rel.cov %>% tibble::column_to_rownames(var = "Feature")), taxa_are_rows = TRUE) #Phyloseq OTU table

```



## Decision on the method to scale coverages

It can be seen from the above plots that the regression parameters differ based on the method of regression and whether or not data filtering is used. These regression parameters directly influence the estimation of concentrations of features of interest in the microbiome. Thus, the method for regression must be a careful choice. 

```{r, warning = FALSE, message = FALSE}
sessionInfo()
```

<!--chapter:end:03-Scaling-coverage-data.Rmd-->

# SIP metagenomics and identification of isotope incorporators {#identifying-incorporators}

The isotope incorporators could be identified using the below methods:

+ AFE mathematical model by [Hungate et al. (2015)](https://journals.asm.org/doi/10.1128/AEM.02280-15) or
  + Delta buoyant density $AFE_{ΔBD}$ = $\frac{W_{Lab}- W_{Light}}{I_{max}}$.        
  Where Imax is the maximum linear shift in DNA BD (upon 100% labeling), as        discussed by [Birnie and Rickwood (1978)](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC2595786/)
  + MW-HR-SIP [Youngblut et al. ](https://www.frontiersin.org/articles/10.3389/fmicb.2018.00570/full)

While the first two methods gives a quantitative estimate of isotopic enrichment in a taxon to determine if it is an incorporator, the third method relies on differential abundance analysis over multiple overlapping windows. Please refer to Youngblut et al. and [HTS-SIP R package](https://cran.r-project.org/web/packages/HTSSIP/index.html) for more details.


## Comparison of the methods

### AFE method

Isotope incorporators from the introductory vignette were identified using the AFE method
Reproducing the code here, we have

```{r Load data 3, echo = F, warning = FALSE, message = FALSE}
## Load data
#Coverage metadata
#Uncomment if your coverage data is in the format mentioned above for this file. Remains commented if you are using the output from `checkm coverage`
f_tibble <- readr::read_csv("mock_input_data/coverage_metadata.csv")

#Sequins metadata
sequins <- readr::read_csv(file="mock_input_data/sequins_metadata.csv")

#Dilutions data
seq_dil = readr::read_csv(file = "mock_input_data/dilutions_data.csv")

#Log scale BOOLEAN. True or False depending on how you would want the MAG coverages to be scaled. Select TRUE if you need MAG concentrations scaled on the log scale
log_scale = TRUE

#coe_of_variation. Acceptable coefficient of variation for coverage and detection (eg. 20 - for 20 % threshold of coefficient of variation) (Coverages above the threshold value will be flagged in the plots)
coe_of_variation = 20 

#Taxonomy
gtdbtk_bac_summary = readr::read_delim("mock_input_data/gtdbtk.bac120.summary.tsv", 
                                 "\t", escape_double = FALSE, trim_ws = TRUE)
gtdbtk_archaea = readr::read_delim("mock_input_data/gtdbtk.ar122.summary.tsv", 
                             "\t", escape_double = FALSE, trim_ws = TRUE)
#GC content
GC_content <- readr::read_csv(file = "mock_input_data/GC_content.csv")

#Fractions
fractions = readr::read_csv("mock_input_data/fractions.csv")

```

```{r estimate absolute concentrations 2, echo = F, warning = FALSE, message = FALSE}
taxonomy_tibble = dplyr::bind_rows(gtdbtk_bac_summary, gtdbtk_archaea) #Combine bacteria and archaea taxonomy files if it has not been done yet
#mag_tab is a tibble with absolute concentrations of MAGs obtained by scaling MAG coverages using linear regression models on sequin coverages and concentration

##Scale MAG coverages to obtain MAG absolute concentrations and save scaling plots in the working directory
#For rlm scaling using scale_features_rlm
#For rlm scaling using scale_features_lm
mag_tab_scaled_lm <- SIPmg::scale_features_lm(f_tibble, sequins, seq_dil, log_scale, coe_of_variation = coe_of_variation, cook_filtering = TRUE, save_plots = FALSE)
mag_tab = as.matrix(mag_tab_scaled_lm$mag_tab) #Extract absolute abundances as a matrix
```

```{r Make phyloseq objects 2, echo = F, warning = FALSE, message = FALSE}

mag.table = phyloseq::otu_table(mag_tab, taxa_are_rows = TRUE) #Phyloseq OTU table

taxonomy.object = SIPmg::tax.table(taxonomy_tibble) # Create a taxonomy phyloseq object
samples.object = SIPmg::sample.table(fractions) # Create a samples phyloseq object
phylo.qSIP = SIPmg::phylo.table(mag.table, taxonomy.object, samples.object) # Make a phyloseq table for downstream qSIP analysis
```

```{r AFE methodGet bootstrapped AFE table 2, warning = FALSE, message = FALSE}
atomX = SIPmg::qSIP_atom_excess_MAGs(phylo.qSIP,
                               control_expr='Isotope=="12C"',
                               treatment_rep='Replicate',
                               Gi = GC_content)
#Bootstrap confidence intervals
df_atomX_boot = SIPmg::qSIP_bootstrap_fcr(atomX, n_boot=10)
CI_threshold = 0
df_atomX_boot = df_atomX_boot %>%
  dplyr::mutate(Incorporator_qSIP = A_CI_fcr_low > CI_threshold,
                Incorporator_delbd = A_delbd - A_delbd_sd > 0,
         OTU = stats::reorder(OTU, -A))

df_atomX_boot = df_atomX_boot %>%
  dplyr::inner_join(taxonomy_tibble %>% 
               dplyr::select(user_genome, classification) %>%
               dplyr::rename(OTU = user_genome)) 
```

### MW-HR-SIP

Check incorporators in the overlapping windows of 1.71 - 1.74; 1.72 - 1.75; 1.73 - 1.76. The windows must be chosen in a more judicious manner that fits the hypothesis of the study.

```{r, MW-HR-SIP, warning = FALSE, message = FALSE}
windows = data.frame(density_min=c(1.71,1.72, 1.73), 
                     density_max=c(1.74,1.75,1.76))

padj_cutoff = 0.05
#ncores = 6
#doParallel::registerDoParallel(ncores)

mw.hr.sip = SIPmg::HRSIP(physeq = phylo.qSIP, design = ~Isotope,
                                    density_windows = windows,
                                    sparsity_threshold = seq(0, 0.3, 0.05), 
                                    padj_cutoff = padj_cutoff)

mw.hr.sip = mw.hr.sip %>%
  dplyr::mutate(incorp = padj < padj_cutoff)


```

Compare number of incorporators

```{r list incorporators 2, warning = FALSE, message = FALSE}
#Get incorporator info
qSIP_incorp = df_atomX_boot %>%
  dplyr::select(OTU, classification, A, A_sd, Incorporator_qSIP) %>%
  dplyr::filter(Incorporator_qSIP == TRUE) %>%
  dplyr::select(-classification)
n_qSIP_incorp = nrow(qSIP_incorp)

delbd_incorp = df_atomX_boot %>%
  dplyr::select(OTU, classification, A_delbd, A_delbd_sd, Incorporator_delbd) %>%
  dplyr::filter(Incorporator_delbd == TRUE) %>%
  dplyr::select(-classification)
n_delbd_incorp = nrow(delbd_incorp)

mw.hr.sip_incorp = mw.hr.sip %>%
  dplyr::select(OTU, taxa, incorp) %>%
  dplyr::filter(incorp == TRUE) %>%
  dplyr::rename("Incorporator_mw_hr.sip" = "incorp") %>%
  dplyr::select(-taxa)
n_mw.hr.sip_incorp = nrow(mw.hr.sip_incorp)

all_incorp_tibble = dplyr::full_join(qSIP_incorp, dplyr::full_join(delbd_incorp, mw.hr.sip_incorp, by = "OTU"), by = "OTU")

#Print incorporator information
cat('Number of incorporators detected by qSIP:', n_qSIP_incorp, '\n')
cat('Number of incorporators detected by ΔBD:', n_delbd_incorp, '\n')
cat('Number of incorporators detected by MW-HR-SIP:', n_mw.hr.sip_incorp, '\n')

rmarkdown::paged_table(all_incorp_tibble)

```

It is evident here that the AFE estimates using qSIP model or the ΔBD method are not the same, however, both provide the same incorporator information. Although the incorporators may not be the same everytime, we found an overlap between qSIP and ΔBD methods in our metagenome datasets. We also found that qSIP model provided more accurate estimates of AFE compared to ΔBD method. 
\n
It also appears that MW-HR-SIP outputs lower number of incorporators. In our study and by [Youngblut et al.](https://www.frontiersin.org/articles/10.3389/fmicb.2018.00570/full), MW-HR-SIP was found to have better sensitivity compared to qSIP or ΔBD methods. It is interesting that the AFE based methods and differential abundance based methods do not provide the same incorporator identity, which is perhaps due to comparistive analysis in only the multiple overlapping windows in the MW-HR-SIP method, while the AFE methods examine the complete BD gradient. Due to the higher sensitivity of the MW-HR-SIP method, we attempted a sequential analysis with first MW-HR-SIP to eliminate likely false positives, and then perform qSIP to estimate AFE. Such a method reduces multiple comparisons and increases confidence in detecting true incorporators due to higher statistical power.
\n
We are merely providing the different methods for the user to decide which method to apply for detecting incorporators in their study. The choice of the method is to the user's discretion that fits best their study.
## Session information

```{r, warning = FALSE, message = FALSE}
sessionInfo()
```

<!--chapter:end:04-Identifying-isotope-incorporators.Rmd-->