PCRedux_introduction.Rmd

---
title: "Supporting Information 3 (SI 3): PCRedux package - an introduction"
author: "The PCRedux package authors"
date: "`r Sys.Date()`"
output: 
  rmarkdown::pdf_document:
    number_sections: true
    toc: true
    toc_depth: 5
header-includes:
    - \usepackage[font={small}]{caption}
bibliography: "literature.bib"
---

```{r, include=FALSE, echo=FALSE, eval=FALSE}
options(tinytex.verbose = TRUE)
```


```{r, echo=FALSE}
knitr::opts_chunk$set(fig.lp = "", out.extra='', warnings = FALSE, message = FALSE)

amplification_curve_ROI <- "Phases of amplification curves as Region of Interest (ROI). For amplification curves, the fluorescence signal (RFU, relative fluorescence units) of the reporter dye is plotted against the cycle number. Positive amplification curves possess three ROIs: ground phase, exponential phase and plateau phase. These ROIs can be used to determine predictors such as the takedown point (`tdp`) or the standard deviation within the ground phase (`sd\\_bg`). The exponential range (red dots) is used to determine the Cq values and amplification efficiency (not shown). A linear regression model (red) can be used to calculate the slope in this region. B) PCRs without amplification reaction usually show a flat (non-sigmoides) signal. C) The exponential phase of PCR reactions can vary greatly depending on the DNA starting quantity and other factors. Amplification curves that appear in later cycles often have a lower slope in the exponential phase."

amplification_curve_ROI_short <- "Phases of amplification curves as Region of Interest (ROI)"

figure_quantification_points <- "Frequently used methods for the analysis of quantification points. A) The amplification curve is intersected by a gray horizontal line. This is the background signal (3$\\sigma$) determined from the \\textit{68-95-99.7 rule} from the fluorescence emission of cycles 1 to 10. The black horizontal line is the user-defined threshold (Ct value) in the exponential phase. Based on this, the cycle at which the amplification curve differs significantly from the background is calculated. B) The amplification curve can also be analyzed by fitting a multi-parametric model (black line, five parameters). The red line is the first derivative of the amplification curve with a maximum of 17.59 cycles. The first derivative maximum (`cpD1`) is used as a quantification point (Cq value) in some qPCR systems. The green line shows the second derivative of the amplification curve, with a maximum at 15.68 cycles a minimum at 19.5 cycles. The maximum of the second derivative (`cpD2`) is used as the Cq value in many systems. The blue line shows the amplification efficiency estimated from the trajectory of the exponential region. The `Eff` value of 1.795 means that the amplification efficiency is approximately 89\\%. `cpDdiff` is the difference between the first and second derivative maximum ($cpDdiff = cpD1 - cpD2$)."

figure_quantification_points_short <- "Frequently used methods for the analysis of quantification points"

figure_curve_classification <- "Variations in the classification of amplification curves. A prerequisite for the development of machine-learning models is the availability of manually classified amplification curves. Amplification curves (n = 8858) from the `htPCR` data set have been classified by one user eight times at different points over time (classes: ambiguous (a), positive (y) or negative (n)). During this process, the amplification curves were presented in random order. The example shows that different (subjective) class mappings may occur for the same data set. While only a few amplification curves were classified as negative in the first three classification cycles (A-C), their proportion increased almost tenfold in later classification cycles (D-H)."

figure_curve_classification_short <- "Variations of the classification of amplification curves"

htPCR_nap <- "Examples of negative, ambiguous and positive amplification curves. A) A negative (black), ambiguous (red) and positive (green) amplification curve were selected from the `htPCR` data set. The negative amplification curve is non-sigmoid and has a positive trend. The ambiguous amplification curve is similar to a sigmoidic amplification curve, but shows a positive slope in ground phase (cycle 1 $\\rightarrow$ 5). The positive amplification curve (green) is sigmoid. It starts with a flat baseline (cycle 5 $\\rightarrow$ 25). This is followed by the exponential phase (cycle 5 $\\rightarrow$ 25) and ends in a flat plateau phase (cycle 26 $\\rightarrow$ 35). B) Amplification curves of the `vermeulen1` data set were divided into groups with \\textit{negative}, \\textit{ambiguous} and \\textit{positive} classification. Negative amplification curves have a low signal level. Interesting is the spontaneous increase (probably due to a sensor calibration) in cycles 1 to 2 followed by a linear signal decrease. In principle, the ambiguous amplification curves have a sigmoid curve shape. However, the plateau phase is fairly broad. When different qPCR users were asked what they find ambiguous, they responded that there is an additional change in the slope between cycles 15 to 25. This made some believe that the reaction is not valid. One of the ambiguous amplification curves begins to rise sharply at Cycle 45. The positive amplification curves have a characteristic sigmoid curve shape."

htPCR_nap_short <- "Examples of negative, ambiguous and positive amplification curves"

htPCR_nap_frequency <- "Frequency of amplification curve classes and conformity in the `htPCR` data set. The `htPCR` data set was classified by hand eight times. Due to the unusual amplification curve shape and input errors during classification, many amplification curves were classified differently. A) Frequency of negative (black), ambiguous (red) and positive (green) amplification curves in the `htPCR` data set. The combined number of ambiguous and negative amplification curves appears to be higher, than the number of positive amplification curves. B) The number of observations where all classification cycles made the same decision (conformity == TRUE) accounts for only 5\\% of the total number of observations. TRUE, all classes of the amplification curve matched. FALSE, at least one in eight observations had a different class."

htPCR_nap_short_frequency <- "Frequency of amplification curve classes and conformity in the `htPCR` data set."

qPCR2fdata <- "Shape-based clustering of amplification curves. A) The clustering of the amplification curves of the `testdat` data set (A) was based on the Hausdorff distance. B) The amplification curves were converted with the qPCR2fdata() function, and the Hausdorff distance of the curves was determined by cluster analysis. There were no errors in distinguishing between negative (n) and positive (y) amplification curves."

qPCR2fdata_short <- "Shape-based grouping of amplification curves"

HCU32 <- "Clustering and variation analysis of amplification curves. The amplification curves of the 32HCU were processed with the qPCR2fdata() function and then processed by cluster analysis (Hausdorff distance). A) Amplification curves were plotted from the raw data. B) Overall, signal-to-noise ratios of the amplification curves between all cavities were similar. C) The Cq values and amplification efficiency were calculated using the efficiency(pcrfit()) [\\texttt{qpcR}] function. The median Cq is shown as a vertical line. Cqs greater or less than 0. 1 of Cq $\\tilde{x}$ are marked with observation labels. D) The cluster analysis showed no specific pattern with respect to the amplification curve signals. It appears that the observations D1, E1, F1, F3, G3 and H1 differ most from the other amplification curves."

HCU32_short <- "Clustering and variation analysis of amplification curves"

winklR_principle <- "Concept of the winklR() function. Analysis of the amplification curves of the `RAS002` data set with the winklR() function. Two amplification curves (A: positive, B: negative) were used. The red point shows the origin (first negative derivative maximum) while the green and blue points show the minimum and maximum of the second negative derivative. The angle is calculated from these points. Positive curves have smaller angles than negative curves."

winklR_principle_short <- "Concept of the winklR() function"

winklR <- "Analysis of the amplification curves of the `RAS002` data set with the winklR() function. All amplification curves of the data set `RAS002` were analyzed. Negative amplification curves are shown in red and positive amplification curves in black. The winklR() function was used to analyze all amplification curves. B) The stripchart of the analysis of positive and negative amplification curves shows separation. C) The cdplot calculates the conditional densities of x based on the values of y weighted by the boundary distribution of y. The densities are derived cumulatively via the values of y. The probability that the decision is negative (n) when the angle equals 30 is approximately 100\\%."

winklR_short <- "Variation analysis of amplification curves with the winklR() function"

curve_fit_fail <- "Incorrect model adjustment for amplification curves. A positive (black), and a negative (red) amplification curve were randomly selected from the `RAS002` data set. The positive amplification curve has a baseline signal of about 2500 RFU (Raw Fluorescene Unit) and has a definite sigmoidal shape. The negative amplification curve has a baseline signal of approx. 4200 RFU, but only moderately positive slope (no sigmoidal shape). A logistic function with seven parameters (`l7`) has been fitted to both amplification curves. A Cq value of 25.95 was determined for the positive amplification curve. The negative amplification curve had a Cq value of 9.41. However, it can be seen that the latter model fitting is not appropriate for calculating a trustworthy Cq value. An automatic calculation without user control would give a false-positive result. Note: This plot was shown in linear scale to demonstrate typical pitfalls."

curve_fit_fail_short <- "Incorrect model adjustment for amplification curves"

plot_models <- "Frequencies of the fitted multiparametric models and Cq values. The amplification curves (n = 3302) of the `data\\_sample` data set were analyzed with the encu() function. The amplification curves have been stratified according to their classes (negative: grey, positive: green). A) The optimal multiparametric model was selected for each amplification curve based on the Akaike information criterion. lNA stands for `no model` and l4 \\ldots l7 for a model with four to seven parameters. B) All Cq values were calculated from optimal multiparametric models. Cqs of positive amplification curves accumulate in the range between 15 and 30 PCR cycles (50\\%). For the negative amplification curves, the Cqs are distributed over the entire span of the cycles. Note: The Cqs of the negative amplification curves are false-positive!"

plot_models_short <- "Frequencies of the fitted multiparametric models and Cq values"

figure_cpD2_range <- "Location of the predictors `cpD2\\_range`, `bg.start`, `bg.stop` within an amplification curve. The minimum (cpD2m) and maximum (cpD2) of the second derivative were calculated numerically using the diffQ2() function. This function also returns the maximum of the first derivative (cpD1). The `cpD2\\_range` is defined as $cpD2\\_range = |cpD2 - cpD2m|$. Large `cpD2\\_range` values indicate a low amplification efficiency or a negative amplification reaction. The predictor `bg.start` is an estimate for the end of the ground phase. `bg.start` is an approximation for the onset of the plateau phase."

figure_cpD2_range_short <- "Location of the of the predictors `cpD2\\_range`, `bg.start`, `bg.stop` within an amplification curve"

plot_dat_EffTop <- "Values of predictors calculated from negative and positive amplification curves. Amplification curve predictors from the `data\\_sample\\_subset` data set were used as they contain positive and negative amplification curves, and amplification curves that exhibit a \\textit{hook effect} or non-sigmoid shapes. A) `eff`, optimized PCR efficiency found within a sliding window. B) `sliwin`, PCR efficiency by the `window-of-linearity` method. C) `cpDdiff`, difference between the Cq values calculated from the first and the second derivative maximum. D) `loglin\\_slope`, slope from the cycle at the second derivative maximum to the second derivative minimum. E) `cpD2\\_range`, absolute value of the difference between the minimum and the maximum of the second derivative maximum. F) `top`, takeoff point. G) `f.top`, fluorescence intensity at takeoff point. H) `tdp`,  takedown point. I) `f.tdp`, fluorescence intensity at takedown point. J) `bg.stop`, estimated end of the ground phase. K) `amp.stop`, estimated end of the exponential phase. L) `convInfo\\_iteratons`, number of iterations until convergence."

plot_dat_EffTop_short <- "Analysis of location predictors"

loglin_slope <- "Concept of the `loglin\\_slope` predictor. The algorithm determines the fluorescence values of the raw data at the approximate positions of the maximum off the first derivative, the minimum of the second derivative and the maximum of the second derivative, which are in the exponential phase of the amplification curve. The data were taken from the `RAS002` data set. A linear model is created from these parameter sets and the slope is determined. A) Positive amplification curves have a clearly positive slope. B) Negative amplification curves usually have a low, sometimes negative slope."

loglin_slope_short <- "Concept of the `loglin\\_slope` predictor"

plot_sd_bg <- "Standard deviation in the ground phase of various qPCR devices. The `sd\\_bg` predictor was used to determine if the standard deviation between thermo-cyclers and between positive and negative amplification curves was different. The standard deviation was determined from the fluorescence values from the first cycle to the takeoff point. If the takeoff point could not be determined, the standard deviation from the first cycle to the eighth cycle was calculated. The Mann-Whitney test was used to compare the medians of the two populations (y, positive; n, negative). The differences were significant for A) LC\\_480 (Roche), B) CFX96 (Bio-Rad) and C) LC96 (Roche)."

plot_sd_bg_short <- "Standard deviation in the ground phase of various qPCR devices"

plot_bg_pt <- "Values of predictors calculated from negative and positive amplification curves. Amplification curve predictors from the `data\\_sample\\_subset` data set were used as they contain positive and negative amplification curves, as well as amplification curves that exhibit a \\textit{hook effect} or non-sigmoid shapes. A) `eff`, optimized PCR efficiency in a sliding window. B) `sliwin`, PCR efficiency according to the window-of-linearity method. C) `cpDdiff`, difference between the Cq values calculated from the first and the second derivative maximum. D) `loglin\\_slope`, slope from cycle at second derivative maximum to second derivative minimum. E) `cpD2\\_range`, absolute difference between the minimum and maximum of the second derivative. F) `top`, takeoff point. G) `f.top`, fluorescence intensity at takeoff point. H) `tdp`, takedown point. I) `f.tdp`, fluorescence intensity at the takedown point. J) `bg.stop`, estimated end of the ground phase. K) `amp.stop`, estimated end of the exponential phase. L) `convInfo\\_iteratons`, number of iterations until convergence when fitting a multiparametric model. The classes were compared using the Wilcoxon Rank Sum Test."

plot_bg_pt_short <- "Values of predictors calculated from negative and positive amplification curves"

plot_model_param <- "Values of predictors calculated from negative and positive amplification curves. Amplification curve predictors from the `data\\_sample\\_subset` data set were used as they contain positive and negative amplification curves, as well as amplification curves that exhibit a \\textit{hook effect} or non-sigmoid shapes. A) `c\\_model\\_param`, is the c model parameter of the seven parameter model. B) `d\\_model\\_param`, is the d model parameter of the seven parameter model. C) `e\\_model\\_param`, is the e model parameter of the seven parameter model. D) `f\\_model\\_param`, is the f model parameter of the seven parameter model. The classes were compared using the Wilcoxon Rank Sum Test."

plot_model_param_short <- "Values of predictors calculated from negative and positive amplification curves"

plot_Logistic_Regression <- "Machine classification by means of binomial logistic regression using the `loglin\\_slope` predictor. A) For the calculation of a binomial logistic regression model, the categorical response variable $Y$ (decision with classes: negative and positive) must be converted to a numerical value. With binomial logistic regression, the probability of a categorical response can be estimated using the $X$ predictor variable. In this example, the predictor variable `loglin\\_slope` is used. Grey measurement points (70\\% of the data set) were used for training. Red dots represent the values used for testing. The regression curve of the binomial logistic regression is shown in blue. The grey horizontal line at 0.5 marks the threshold of probability above which it is determined whether an amplification curve is negative or positive. B) The performance indicators were calculated using the performeR() function. Sensitivity, TPR; Specificity, SPC; Precision, PPV; Negative prediction value, NPV; Fall-out, FPR; False negative rate, FNR; False detection rate, FDR; Accuracy, ACC; F1 score, F1; Matthews correlation coefficient, MCC, Cohens kappa (binary classification), kappa ($\\kappa$)."

plot_Logistic_Regression_short <- "Machine classification by means of binomial logistic regression using the `loglin\\_slope` predictor"

statistical_methods_amptester <- "Analysis of amplification curves with the ``amptester()`` function. A \\& B) The threshold test (THt) is based on the Wilcoxon ranksum test and compares 20\\% of the fluorescence values of the ground phase with 15\\% of the plateau phase. In the example, a significant difference ($p = 0.000512$) was found for the positive amplification curve. However, this did not apply to the negative amplification curve ($p = 0.621$). C \\& D) A Q-Q diagram is used to graphically compare two probability distributions.  In this study the probability distribution of the amplification curve was compared with a theoretical normal distribution. The orange line is the theoretically normal quantil-quantile plot that passes through the probabilities of the first and third quartiles. The Shapiro-Wilk test (SHt) of normality checks whether the underlying measurement data of the amplification curve is significantly normal distributed. Since the p-value of $7.09 e^{-9}$ of the positive amplification curve is $\\alpha \\leq 5e^{-4}$, the null hypothesis is rejected. However, this does not apply to the negative amplification curve ($p = 0.895$). E \\& F) The linear regression test (LRt) calculates the coefficient of determination ($R^{2}$) using an ordinary least square regression where all measured values are integrated into the model in a cycle-dependent manner. Experience shows that the non-linear part of an amplification curve has a $R^{2}$ smaller than 0.8, which is also shown in the example."

statistical_methods_amptester_short <- "Analysis of amplification curves with the amptester() function"

figure_autocorrelation_tau <- "Effect of tau"

figure_autocorrelation_tau_short <- "Effect of tau"

autocorrelation <- "Autocorrelation analysis of the amplification curves of the `RAS002` data set. A) Display of all amplification curves of the data set `RAS002`. Negative amplification curves are shown in red and positive amplification curves in black. The autocorrelation\\_test() function was used to analyze all amplification curves. B) The density diagram of the autocorrelation of positive and negative amplification curves shows a bimodal distribution. C) The cdplot calculates the conditional densities of x based on the values of y weighted by the boundary distribution of y. The densities are derived cumulatively via the values of y. The probability that the decision is negative (n) when autocorrelation equals 0.85 is approximately 100\\%. D) Performance analysis using the performeR() function (see \\autoref{section_helper_functions} for details)."

autocorrelation_short <- "Autocorrelation analysis for amplification curves of the `RAS002` data set)"

earlyreg_slopes <- "Analysis of the ground phase with the earlyreg() function and the `C127EGHP` data set (n = 64 amplification curves). This data set consists of 32 samples, which were simultaneously monitored with the intercalator EvaGreen or hydrolysis probes. A) All amplification curves possess slightly different slopes and intercepts in the first cycles of the ground phase (ROI: Cycles 1 to 8). Both the slope and the intercept of each amplification curve were used for cluster analysis (k-means, Hartigan-Wong algorithm, number of centers \\textit{k = 2}). B) The amplification curves were assigned to five clusters, depending on their slope and their intersection (red, black). C) Finally, the clusters were associated to the detection chemistries (EvaGreen (EG) or hydrolysis probes (HP))."

earlyreg_slopes_short <- "Analysis of the ground phase with the earlyreg() function"

figure_head2tailratio <- "Ratio between the head and the tail of a quantitative PCR amplification curve. A) Plot of quantile normalized amplification curves from the `RAS002` data set. Data points used in the head and and tail are highlighted by circles. The intervals for the Robust Linear Regression are automatically selected using the 25\\% and 75\\% quantiles. Therefore, not all data points are used in the regression model. The straight line is the regression line from the robust linear model. The slopes of the positive and negative amplification curves differ. B) Boxplot for the comparison of the $head/tail$ ratio. Positive amplification curves have a lower ratio than negative curves. The difference between the classes is significant."

figure_head2tailratio_short <- "Ratio between the head and the tail of a quantitative PCR amplification curve"

plot_mblrr <- "Robust local regression to analyze amplification curves. The amplification curves were arbitrarily selected from the `RAS002` data set. In the qPCR setup, the target genes beta globin (B. globin) and HPRT1 were simultaneously measured in a PCR cavity using two specific hydrolysis probes (duplex qPCR). Both positive (A, C, E) and negative (B, D, F) amplification curves were used. The amplification curves are normalized to the 99\\% quantile. The differences in slopes and intercepts (blue and orange lines and dots). The mblrr() function is presumably useful for data sets which are accompanied by noise and artifacts."

plot_mblrr_short <- "Robust local regression to analyze amplification curves"

plot_FFTrees <- "Visualization of decisions in  Fast and Frugal Trees after data analysis of amplification curves via the mblrr() function. \\textbf{Top row} `Data`) Overview of the data set, stating the total number of observations (N = 192) and percentage of positive (22\\%) and negative (78\\%) amplification curves. \\textbf{Middle row} `FFT \\#1 (of 6)`) Decision Tree with the number of observations classified at each level of the tree. For the analysis, six predictors (nBG, intercept of head region; mBG, slope of head region; rBG, Pearson correlation of head region; nTP, intercept of tail region; mTP, slope of tail region; rBG, Pearson correlation of tail region) have been used for the analysis. After two tree levels (nBG, nTP), the decision tree is created, where all positive amplification curves (N = 40) are correctly classified. Two observations are classified as false-negative in the negative amplification curves. \\textbf{Lower row} `Performance`)  The FFTrees() [FFTrees] function determines several performance statistics. For the training data, there is a classification table on the left side showing the relationship between tree `decision` and the `truth`. The correct rejection (`Cor Rej`) and `Hit` are the right decisions. `Miss` and false alarm (`False Al`) are wrong decisions. The centre shows the cumulative tree performance in terms of mean of used cues (`mcu`), Percent of ignored cues (`pci`), sensitivity (`sens`), specificity (`spec`), accuracy (`acc`) and weighted Accuracy (`wacc`). The receiver operating characteristic (ROC) curve on the right-hand side compares the performance of all trees in the FFTrees object. The system also displays the performance of the fast frugal trees (`\\#`, green), CART (`C`, red), logistical regression (`L`, blue), random forest (`R`, violet) and the support vector machine (`S`, yellow)."

plot_FFTrees_short <- "Visualization of decisions in  Fast and Frugal Trees after data analysis of amplification curves via the mblrr() function"

plot_peaks_ratio <- "Working principle of the `peaks\\_ratio` predictor. The computation is based on a sequential linking of functions. The diffQ() function determines numerically the first derivative of an amplification curve. This derivative is passed to the mcaPeaks() [\\texttt{MBmca}] function. In the output all minima and all maxima are contained. The ranges are calculated from the minima and maxima. The Lagged Difference is determined from the ranges of the minima and maxima. Finally, the ratio of the differences (maximum/minimum) is calculated."

plot_peaks_ratio_short <- "Working principle of the `peaks\\_ratio` predictor"

plot_cp_area <- "Values of predictors calculated from negative and positive amplification curves. Amplification curves predictors from the `data\\_sample\\_subset` data set were used as they contain positive and negative amplification curves and amplification curves that exhibit a \\textit{hook effect} or non-sigmoid shapes. A) `polyarea`, is the area under the amplification curve determined by the Gauss polygon area formula. B) `peaks\\_ratio`, is the ratio of the local minima and the local maxima. C) `cp\\_e.agglo`, makes use of energy agglomerative clustering. Positive amplification curves have fewer change points than negative amplification curves. These two change point analyses generally separate positive and negative amplification curves. D) `cp\\_bcp`, analyses change points by a Bayesian approach. Positive amplification curves appear to contain more change points than negative amplification curves. Nevertheless, there is an overlap between the positive and negative amplification curves in both methods. This can lead to false-positive or false-negative classifications. E) `amptester\\_polygon` is the cycle normalized order of a polygon.  F) `amptester\\_slope.ratio` is the slope (linear model) of the raw fluorescence values at the approximate first derivate maximum, second derivative minimum and second derivative maximum."

plot_cp_area_short <- "Analysis of predictors that decribe the area and changepoints of an amplification curve"

plot_cpa <- "Bayesian and energy agglomerative change point analysis on negative and positive amplification curves. An analysis of a negative and a positive amplification curve from the `RAS002` data set was performed using the pcrfit\\_single() function. In this process, the amplification curves were analysed for change points using Bayesian change point analysis and energy agglomerative clustering. A) The negative amplification curve has a base signal of approximately 2450 RFU and only a small signal increase to 2650 RFU. There is a clear indication of the signal variation (noise). B) The first negative derivative amplifies the noise so that some peaks are visible. C) The change point analysis shows changes in energy agglomerative clustering at several positions (green vertical line). The Bayesian change point analysis rarely exceeds a probability of 0.6 (grey vert line). D) The positive amplification curve has a lower base signal ($\\sim 2450$ RFU) and increases up to the 40th cycle ($\\sim 3400$ RFU). A sigmoid shape of the curve is visible. E) The first negative derivation of the positive amplification curve shows a distinctive peak with a minimum at cycle 25. F) The change point analysis in energy agglomerative clustering shows changes (green vertical line) only at two positions. The Bayesian change point analysis shows a probability higher than 0.6 (grey horizontal line) at several positions."

plot_cpa_short <- "Bayesian and energy agglomerative change point analysis on negative and positive amplification curves"

plot_random_forest <- " The predictors `amptester\\_lrt` (lrt), `amptester\\_rgt` (rgt), `amptester\\_tht` (tht), `amptester\\_slt` (slt), `amptester\\_polygon`(polygon) and `amptester\\_slope.ratio` (slope.ratio) were used for classification using random forest. A) This plot shows the error depending on the number of trees. The error decreases as more and more trees are added and averaged. B). Mean Decrease Accuracy shown how much the model accuracy decreases if a variable is dropped. C) Mean Decrease Gini shows the importance of a variable based on the Gini impurity index used for the calculation of splits in trees."

plot_random_forest_short <- "Random Forest"


#----------Tables-------------------------------------
```

\newpage

\begin{figure}[ht]
\centering
    \scalebox{0.6}{
    \includegraphics[clip=true,trim=1cm 1cm 1cm 1cm]{Logo.pdf}
}
\end{figure}

- A comprehensive PDF version (including domain knowledge about qPCRs and machine learning) of this document is available **[online supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)**. This online document contains also the code that was used to generate the plot in this introduction.
- The code for all figures and analysis in the manuscript "PCRedux: A Data Mining and Machine Learning Toolkit for qPCR Experiments" is available at **[https://github.com/PCRuniversum/PCRedux-supplements](https://github.com/PCRuniversum/PCRedux-supplements)**.


# Aims of the Project

A review of the literature (PubMed, Google Scholar; 1984-01-01 - `r Sys.Date()`) and discussion with peers revealed that there is no open source software package to calculate predictors from quantitative PCR amplification curves for machine learning applications. A predictor is a quantifiable *informative* property of an amplification curve. In particular, there is no information available about predictors that can be used from amplification curves apart from measures that describe quantification points, amplification efficiencies and signal levels. Although several amplification curve data sets are available, no curated labeled data sets labels are described in the literature or repositories such as GitHub\footnote{\url{https://github.com/}}, Bitbucket\footnote{\url{https://bitbucket.org/}}, SourceForge\footnote{\url{https://sourceforge.net/}} or Kaggle\footnote{\url{https://www.kaggle.com/}}.

Therefore, the aim of the study was to:

1. create a collection of classified amplification curve data,
2. propose algorithms that can be used to calculate predictors from amplification curves,
3. evaluate pipelines that can be used for an automatic classification of amplification curves based on the curve shape and
4. to bundle the findings in a public repository open source software and open data package.

# Introduction and important infromation about the \texttt{PCRedux} package\label{label1}

In the **[online supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)** the reader finds an introduction to nucleic acids, including nucleic acid detection methods (e.g., melting curve analysis, photometric measurements) for the analysis of patient and forensic sample material. Special intention is paid to the quantitative Polymerase Chain Reaction (qPCR), since this method is the *de facto* standard for the detection and high-precision quantification of nucleic acids. 

The focus of this study is the development of statistical and bioinformatical algorithms for the \texttt{PCRedux} software (version `r packageVersion("PCRedux")`). This software can be used to automatically calculate putative predictors (*features*) from qPCR amplification curves. A predictor herein refers to a quantifiable *informative* property of an amplification curve, employable for data mining, machine learning applications and classification tasks.

On the basis of these observations, concepts for predictors (*features*) were developed and implemented in algorithms to describe amplification curves. The functions described in the following are aimed for experimental studies. It is important to note that the concepts for the predictors proposed herein emerged by a *critical reasoning* process and *domain knowledge* of the \texttt{PCRedux} package creator. The aim of the package is to propose a set of predictors, functions and data for an independent research. 


## Development, Implementation, InstallationVersion Control and Continuous Integration

The **[online supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)** deals with elements of the software engineering (e.&nbsp;g., continuous integration, Donald Knuth's *Literate Programming*, unit testing) used within the \texttt{PCRedux} software. The \autoref{section_reasoning_and_analysis} gives an introduction into qPCR data, their analysis and explains why there is a need for the \texttt{PCRedux} software. In addition, the data analysis using machine learning is concisely described, after which the work focuses on the analysis of the measured data. 

The proposed algorithms were partially tested with machine learning methods. For this purpose, a brief introduction to the subject \emph{machine learning} is given in the **[online supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)**.

Information for statistical analyses of qPCR amplification curves are presented in \autoref{section_DataAnalysis} ff. This covers the description of the curvature and the challenges of the calculations. 

All scientific and engineering work depends on data. In particular, *open data* are becoming a cornerstone in science. As data sets of classified amplification curves were not available anywhere else, the **[online supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)** summarizes the aggregation, maintenance, and distribution of classified qPCR amplification curve data sets. The manual classification of amplification curves is a time-consuming and error prone task when working with large data sets. To facilitate the manual analysis procedure, helper tools like ``humanrater()`` are presented in the the **[online supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)**. A novel approach for *curve-shape based group classification* is shown too.

An achievement of this study is the extensive portfolio of statistical algorithms for predictor calculation. Central findings of the research are presented in \autoref{section_Functions_of_PCRedux}. 

It is expected that these implementations will allow the automatic analysis of large data sets for machine learning applications. The expectations of the findings are critically discussed in \autoref{section_Summary_and_conclusions}.

## PCRedux-app

\texttt{PCRedux-app} is a web server, based on the shiny technology [@shiny_2016] wrapped around the ``encu()`` function (\autoref{section_pcrfit_single_pcrfit_parallel}). An user can upload qPCR data and download obtained amplification curve features. 

There are different ways to use the function.

- Through \texttt{RScript} (Scripting Front-End for R):
    - Enter the command ``Rscript -e 'PCRedux::run_PCRedux()'`` in a console and copy the pasted URL in a browser.

- Through Graphical User Interfaces:
    - The function can be started directly in \texttt{RStudio} or \texttt{RKWard} [@roediger_rkward_2012] by:

```{r, echo=TRUE, eval=FALSE}
# run the Shiny app 
PCRedux::run_PCRedux()
```


## Analysis of Sigmoid-Shaped Curves for Data Mining and Machine Learning Applications\label{section_reasoning_and_analysis}

The following sections describe \texttt{PCRedux} regarding the analysis, 
numerical description and predictor calculation from a sigmoid curve. A 
predictor herein refers to a quantifiable *informative* property of a sigmoid 
curve. The predictors (\autoref{section_Functions_of_PCRedux}), sometimes 
referred to as descriptors, can be used for applications such as data mining, 
machine learning and automatic classification (e.&nbsp;g., negative or positive 
amplification). 

Machine learning is a scientific discipline that deals with the use of simple to 
sophisticated algorithms to learn from large volumes of data. A number of 
approaches to machine learning exist. Supervised learning algorithms are trained 
with data which contain correct answers [@zielesny_curve_2011; 
@walsh_correct_2015; @fernandez-delgado_we_2014]. This allows to create models 
that assign the data to the answers and use these for further processing and 
predictions [@tolson_machine_2001]. Unsupervised algorithms learn from data 
without answers. They use large, diverse data sets for self-improvement. Neural 
networks or artificial neural networks are a type of machine learning that 
roughly resembles the function of human neurons. They are computer programs that 
use several levels of nodes (neurons), work in parallel for learning, recognize 
patterns and make decisions in a human-like manner [@gunther_neuralnet:_2010]. 
Deep Learning uses a deep neural network with many neuronal layers and an 
extensive volume of data [@shin_deep_2016]. They solve complex, non-linear 
problems and are responsible for groundbreaking innovations through artificial 
intelligence, such as the processing of a natural language or images 
[@tolson_machine_2001]. Applications in the life sciences have already been 
described for each of these methods. Up to now there appears to be no study that 
uses machine learning for the classification of amplification curves in a 
scientific setting. 

The determination of quantification points such as the Cq value is a typical 
task during the analysis of qPCR experiments. This is also covered by the 
\texttt{PCRedux} software in dedicated sections and the **[online 
supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)**.

Characteristics of amplification curves that can be used for the statistical and 
analytical description are discussed in the **[online 
supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)** in more 
in detail. The examples described focus on the concepts for **binary 
(dichotomous) classification** [@kruppa_probability_2014] as negative or 
positive. The mere binary classification into classes "positive" or "negative" 
is not necessarily the aim of the \texttt{PCRedux} package. Instead, it is aimed 
to provide a tool set for automatic **multicategory (polychotomus) 
classification** of amplification curves by any class conceivable. Such 
classification could be used for the quality of an amplification curve as 
negative, ambiguous and positive (\autoref{htPCR_nap}A & B). A definition for 
binary (dichotomous) classification and multicategory (polychotomus) 
classification is presented in @kruppa_probability_2014.

```{r htPCR_nap, echo=FALSE, fig.cap=htPCR_nap, fig.scap=htPCR_nap_short}
# Create graphic device for the plot(s)
layout(matrix(c(1,1,2,3,4,5),2,3,byrow=TRUE))

data <- qpcR::htPCR

matplot(
    data[, 1], log(data[, c(552, 512, 616)]), xlab = "Cycles", ylab = "log(RFU)",
        main = "", type = "l", lty = 1, lwd = 2
)
legend("topleft", c(
    paste("negative ", colnames(data)[552]),
                    paste("ambiguous ", colnames(data)[512]),
                    paste("positive ", colnames(data)[616])
), col = 1:3, pch = 19, bty = "n")
mtext("A", cex = 1, side = 3, adj = 0, font = 2)


# Empty plot

plot(NA, NA, xlim = c(0,1), ylim = c(0,1), xlab ="", ylab = "", xaxt ="n", yaxt="n", bty ="n")

# Load the decision_res_htPCR.csv data set from a csv file.
filename <- system.file("decision_res_vermeulen1.csv", package = "PCRedux")
dec <- read.csv(filename)[, 2]

usr_colors <- c(
    adjustcolor("black", alpha.f = 0.80), adjustcolor("red", alpha.f = 0.80),                                         adjustcolor("green", alpha.f = 0.80)
)

colors <- as.character(factor(dec, 
                              levels = c("n", "a", "y"),
                              labels = usr_colors))

data <- qpcR::vermeulen1
dy_range <- range(data[, -1], na.rm=TRUE)
decision <- c("n", "a", "y")

labels <- c("B  negative", "     ambiguous", "   positive")

for(i in 1L:length(decision)) {
samples <- ifelse(length(which(dec == decision[i])) > 50, 25, length(which(dec == decision[i])))
matplot(
    data[, 1], log(data[, sample(which(dec == decision[i]) + 1, samples)]), xlab = "Cycles", ylab = "log(RFU)", main = "", type = "l", lty = 1, lwd = 2, col = usr_colors[i]#, ylim = y_range
)
abline(v = 2, col = "grey")
legend("topleft", 
       c("negative ", "ambiguous ", "positive "), 
       col = usr_colors, pch = 19, bty = "n")
mtext(labels[i], cex = 1, side = 3, adj = 0, font = 2)
}
```

## Relation of Machine Learning to the Classification of Amplification Curves\label{section_technologies_amplification_curves_ML}

A few scientific approaches have previously been shown in which machine learning 
was used for the analysis of amplification curves. The intention of 
@gunay_machine_2016 was to improve the determination of Cq values, without 
dealing with classification. The authors postulated that they had developed an 
improved prediction of Cq values using a modified three-parameter model. One 
assumption of their approach was that their modified three-parameter model could 
be applied to any amplification curve. However, there are reasons why such an 
assumption is not valid. 

In the **[online 
supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)** it was described that a considerable proportion of 
amplification curves deviate clearly from a three-parameter model. 
Multiparametric models with more than four parameters are more frequently 
adapted to amplification curves. In addition, the multiparametric models tend to 
adapt to noise (\autoref{curve_fit_fail}). Unsurprisingly, a Cq value is 
calculated for actually negative amplification curves, demonstrating that a 
three-parameter model alone cannot provide reliable predictions. However, a 
correct model is important for the extraction of Cq values, for the 
determination of predictors from the curves and consequently for classification.

Data mining and machine learning can be used for descriptive and predictive 
tasks during the analysis of complex data sets. Data mining uses specific 
methods from statistical inference, software engineering and domain knowledge to 
get a better understanding of the data, and to extract *hidden knowledge* from 
the preprocessed data [@kruppa_probability_2014; @herrera_multiple_2016]. All 
this implies that a human being interacts with the data at the different stages 
of the whole process as part of the workflow in data mining. Elements of the 
data mining process are the preprocessing of the data, the description of the 
data, the exploration of the data and the search for connections and causes.

The availability of classified amplification curve data sets and technologies 
for the classification of amplification curves is of high importance to train 
and validate models. This is dealt with in the **[online 
supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)**.

For machine learning, the type of learning task is the first thing that needs to 
be defined. The learning task can be a classification, clustering or regression 
problem. Next, suitable algorithms can be selected depending on the task. In the 
case of classification problems, it is attempted to predict a *discrete valued* 
output. The labels ($y$) are usually categorical and represent a finite number 
of classes (e.&nbsp;g. "negative", "positive" $\rightarrow$ binary 
classification). With regression tasks, it is attempted to predict a 
*continuously valued* output. Clustering is primarily about forming groups 
(clusters) based on their similarities. More examples are presented in the 
**[online 
supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)**.

In contrast, machine learning uses instructions and data in software modules to 
create models that can be used to make predictions on novel data. In machine 
learning, the human being is much less necessary in the entire process. 
Processes (algorithms) are used to create models with tunable parameters. These 
models automatically adapt their performance to the information (predictors) 
from the data. Well-known examples of machine learning technologies are Decision 
Trees (DT), Boosting, Random Forests (RF), Support Vector Machines (SVM), 
generalized linear models (GLM), logistic regression (LR) and deep neural 
networks (DNN) [@lee_statistical_2010]. The three following concepts of machine 
learning that are frequently described in the literature are *Supervised 
learning*, *Unsupervised learning* and *Reinforcement Learning* which are 
described in detail in the **[online 
supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)**.

# Why is there a need for the \texttt{PCRedux} software?\label{why_PCRedux}

The classification of an amplification curve is feasible using bioanalytical methods such as melting curve analysis [@roediger_RJ_2013] or electrophoretic separation [@westermeier2004]. However, this is not always possible or desirable. 

- Melting curve analysis is used in some qPCRs as a post-processing step to identify samples which contain the specific target sequence (*positive*) based on a specific melting temperature. However, some detection probe systems like hydrolysis probes do not permit such classification. Moreover, nucleic acids with similar biochemical properties but different sequences may have the same melting temperature. 
- An electrophoretic separation (classification of target DNA sequences by size and quantity) often requires too much effort for experiments with high sample throughput.
- There are mathematical qPCR analysis algorithms such as \texttt{linreg} [@ruijter_amplification_2009] that require information on whether an amplification curve is negative or positive for subsequent calculation. 
- Raw data of amplification curves can be fitted with sigmoid functions. Sigmoid functions are non-linear, real-valued, have an S-shaped curvature (\autoref{figure_sigmoid_curve_models}) and can be differentiated (e.&nbsp;g., first derivative maximum, with one local minimum and one local maximum). With the obtained model, predictions can be made. For example, the position of the second derivative maximum can be calculated from this (\autoref{section_DataAnalysis}). In the context of amplification curves, the second derivative maximum is commonly used to describe the relationship between the cycle number and the PCR product formation (\autoref{section_DataAnalysis}). All software assume that the amplification resembles a sigmoid curve shape (ideal positive amplification reaction), or a flat low line (ideal negative amplification reaction). For example, @Ritz2008 published the \texttt{qpcR} \texttt{R} package that contains functions to fit several multi-parameter models. This includes the five-parameter Richardson function [@richards_flexible_1959] (\autoref{l5}). The \texttt{qpcR} package [@Ritz2008] contains an amplification curve test via the ``modlist()`` function. The parameter `check="uni2"` offers an analytical approach, as part of a method for the kinetic outlier detection. It tries to checks for a sigmoid structure of the amplification curve. Then, ``modlist()`` tests for the location of the first derivative maximum and the second derivative maximum. However, multi-parameter functions fit "successful" in most cases including noise and give false positive results. This will be shown in later sections. This is exemplary shown in later sections in combination with the ``amptester()`` [\texttt{chipPCR}] function  [@roediger2015chippcr], which uses fixed thresholds and frequentist inference to identify amplification curves that exceed the threshold ($\mapsto$ classified as positive). However, the analysis can also lead to false-positive classifications as exemplified in the example below and in \autoref{curve_fit_fail}. Therefore, additional classification concepts would be beneficial.


```{r curve_fit_fail, echo=FALSE, fig.cap=curve_fit_fail, fig.scap=curve_fit_fail_short, fig.width=6}
# Load the qpcR package for the model fit.
library(qpcR)
library(chipPCR)

# Select one positive and one negative amplification curve from the PCRedux 
# package.

amp_data <- PCRedux::RAS002[, c("cyc", "A01_gDNA.._unkn_B.Globin", 
                                "B07_gDNA.._unkn_HPRT1")]

colnames(amp_data) <- c("cyc", "positive", "negative")

# Arrange graphs in an matrix and set the plot parameters. An plot the positive
# and negative amplification curve.
hight <- c(3100, 4100)

plot(NA, NA, xlim = range(amp_data[, "cyc"]), 
        ylim = range(amp_data[, c("positive", "negative")]),
        xlab = "Cycles", ylab = "RFU", main = "")

# Apply the amptester() function from the chipPCR package to the amplification 
# curve data and write the results to the main of the plots.

for (i in 2:3) {
    res.ampt <-  suppressMessages(amptester(amp_data[, i]))
    
    # Make a logical connection by two tests (shap.noisy, lrt.test and
    # tht.dec) of amptester to decide if an amplification reaction is
    # positive or negative.
    decision <- ifelse(!res.ampt@decisions[1] &&
    res.ampt@decisions[2] &&
    res.ampt@decisions[4],
    "positive", "negative"
    )
    # The amplification curves were fitted (l7 model) with pcrfit() function. 
    # The Cq was determined with the efficiency() function.
    
    fit <- pcrfit(data = amp_data, cyc = 1, fluo = i, model = l7)
    res <- efficiency(fit, plot = FALSE)
    lines(predict(fit), pch = 19, lty = 1, xlab = "Cycles", ylab = "RFU", 
          main = "", col = i - 1)
    abline(h = res[["fluo"]], col = "grey")
    points(res[["cpD2"]], res[["fluo"]], pch = 19)

    legend(1, hight[i-1], paste0(colnames(amp_data)[i], 
                                 "  curve ->  Decision: ", 
                                 decision, "    Cq: ", res[["cpD2"]]), 
           bty = "n", cex = 1, col = "red"
          )
}
```

- The analysis and classification of sigmoid data (e.&nbsp;g., quantitative PCR) is a manageable task if the data volume is low, or dedicated analysis software is available. An example for a low number of amplification curves is shown in \autoref{figure_sigmoid_curve}A. All `r ncol(chipPCR::C127EGHP)-1` curves exhibit a sigmoid curve shape. It is trivial to classify them as positive by hand. In contrast, the vast number of amplification curves in \autoref{figure_sigmoid_curve}B is barely manageable with a reasonable effort by simple visual inspection. These data originate from a high-throughput experiment that encompasses in total `r suppressMessages(ncol(qpcR::htPCR))-1` amplification curves of which only 200 are shown. A manual analysis of the data is time-consuming and prone to errors. Even for an experienced user, it is difficult to classify the amplification curves unambiguously and reproducible as will be later shown in \autoref{section_data_sets}.
- qPCRs are performed in thermo-cyclers, which are equipped with a real-time monitoring technology. There are numerous commercial manufactures producing thermo-cyclers (\autoref{table-datasets}). An example for a thermo-cycler that originated in a scientific project is the VideoScan technology [@roediger_highly_2013]. Most of the thermo-cyclers have a thermal block with wells at certain positions. Reaction vessels containing the PCR mix are inserted into the wells. There are also thermo-cyclers that use capillary tubes that are heated and cooled by air (e.&nbsp;g., Roche Light Cycler 1.0). The thermo-cycler raises and lowers the temperature in the reaction vessels in discrete, pre-programmed steps so that PCR cycling can take place. Instruments with a real-time monitoring functionality have sensors to measure changes of the fluorescence intensity in the reaction vessel. All thermo-cycler systems use software to processes the amplification curves. Plots of the fluorescence observations versus cycle number obtained from two different qPCR systems are shown in \autoref{figure_sigmoid_curve}A and B. The thermo-cyclers produce different amplification curve shapes even with the same sample material and PCR mastermix because of their technical design, sensors, and software. These factors need to be taken into account during the development of analysis algorithms.


```{r figure_sigmoid_curve_models, echo=FALSE, fig.scap="Sigmoid models of amplification curves", fig.height=6, out.extra='', fig.cap="A) Model function of a one-parameter sigmoid function. B) Model function of a sigmoid function with an intercept $n$ = 0.2 RFU (shift in base-line). C) Model function of a sigmoid function with an intercept ($n$ \\textasciitilde 0.2 RFU) and a square portion $m * x^{2}, m = -0.0005, n = 0.2 RFU$ (hook-effect-like). D) Model function of a sigmoid function with an intercept ($n$) and a square portion of $m * x^{2}$ and additional noise $\\epsilon$ (normal distributed, $\\mu = 0.01, \\sigma = 0.05$). Note: This plot was shown in linear scale to demonstrate typical pitfalls."}

x_val <- seq(-10, 10, 0.5)
y_val <- 1 / (1 + exp(-x_val))
y_val_slope <- 1 / (1 + exp(-x_val)) + 0.2
y_val_slope_quadratic <- 1 / (1 + exp(-x_val)) + -0.0005 * x_val ^ 2 + 0.2
y_val_slope_quadratic_noise <- 1 / (1 + exp(-x_val)) + -0.0005 * x_val ^ 2 + 0.2 + rnorm(length(x_val), mean = 0.01, sd = 0.05)

y_lim <- c(-0.05, max(c(
    y_val, y_val_slope, y_val_slope_quadratic,
    y_val_slope_quadratic_noise
)) * 1.2)

par(mfrow=c(2,2))
plot(x_val, y_val, type = "l", xlab = "x", ylab = "f(x)", ylim = y_lim)
abline(h = 0, col = "grey")
legend("topleft", expression(y == frac(1, (1 + e ^ {
    -x
}))), bty = "n", cex = 0.9)
mtext("A", cex = 1, side = 3, adj = 0, font = 2)

plot(x_val, y_val_slope, type = "l", xlab = "x", ylab = "f(x)", ylim = y_lim)
abline(h = 0, col = "grey")
legend("topleft", expression(y == frac(1, (1 + e ^ {
    -x
})) + n), bty = "n", cex = 0.9)
mtext("B", cex = 1, side = 3, adj = 0, font = 2)

plot(
    x_val, y_val_slope_quadratic, type = "l", xlab = "x", ylab = "f(x)",
     ylim = y_lim
)
abline(h = 0, col = "grey")
legend("topleft", expression(y == frac(1, (1 + e ^ {
    -x
})) + m * x ^ 2 + n), bty = "n", cex = 0.9)
mtext("C", cex = 1, side = 3, adj = 0, font = 2)

plot(
    x_val, y_val_slope_quadratic_noise, type = "l", xlab = "x", ylab = "f(x)",
     ylim = y_lim
)
abline(h = 0, col = "grey")
legend("topleft", expression(y == frac(1, (1 + e ^ {
    -x
})) + m * x ^ 2 + n + epsilon, epsilon %~% N(0, sigma)), bty = "n", cex = 0.9)
mtext("D", cex = 1, side = 3, adj = 0, font = 2)
```


```{r figure_sigmoid_curve, echo=FALSE, fig.scap="Shape of amplification curves", fig.height=6, out.extra='', fig.cap="Amplification curve data from an iQ5 (Bio-Rad) thermo-cycler and a high throughput experiment in the Biomark HD (Fluidigm). A) The `C127EGHP` data set with 64 amplification curves was produced in a conventional thermo-cycler with a 8 x 12 PCR grid. B) The `htPCR` data set, which contains 8858 amplification curves, was produced in a 95 x 96 PCR grid. Only 200 amplification curves are shown. In contrast to `A)` have all amplification curves in `B)` a stong off-set (intercept) between -2.5 and 0 log(RFU). This needs proper baselineing."}
library(chipPCR)
library(qpcR)
# Create graphic device for the plot(s)
par(mfrow = c(2, 1), las = 0, bty = "o", oma = c(0, 0, 0, 0))

# Load C127EGHP data set from the chipPCR package
data <- chipPCR::C127EGHP

colors <- rainbow(1000, alpha = 0.7)
matplot(
    data[, 2], log(data[, c(-1, -2)]), xlab = "Cycles", ylab = "log(RFU)",
        main = "", type = "l", lty = 1, lwd = 2, col = sample(colors, 
                                                              ncol(data) - 2)
)
abline(h = -2, col = "grey")
mtext("A    iQ5 C127EGHP data set", cex = 1, side = 3, adj = 0, font = 2)

matplot(
    htPCR[, 1], log(htPCR[, c(2L:201)]), xlab = "Cycles", ylab = "log(RFU)",
        main = "", type = "l", lty = 1, lwd = 2, col = sample(colors, 200)
)
abline(h = -1, col = "grey")
mtext("B    Biomark HD htPCR data set", cex = 1, side = 3, adj = 0, font = 2)
```

There are several open source and closed source software tools for the analysis of qPCR data [@pabinger_2014]. The software packages deal for example with challenges like missing values and non-detects [@mccall_non-detects_2014], quantification cycle estimation [@Ritz2008; @ruijter_evaluation_2013], relative gene expression analysis [@dvinge_htqpcr:_2009; @pabinger_qpcr:_2009; @neve_unifiedwmwqpcr:_2014] and data analysis pipelines [@pabinger_qpcr:_2009; @ronde_practical_2017; @mallona_pcrefficiency:_2011; @mallona_chainy:_nodate]. More information can be found in the **[online supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)**.

However, a bottleneck of qPCR data analysis is the lack of predictors and software to build classifiers for amplification curves. A classifier herein refers to a vector of **interpretable** predictors that can be used to distinguish the amplification curves only by their shape. A predictor, also referred to as *feature*, is an entity that characterizes an object. A few potential predictors for amplification curves are described in the literature.


# Principles of Amplification Curve Data Analysis and Predictor Calculation\label{section_DataAnalysis}

The shape of a positive amplification curve is in most cases sigmoidal. Many factors such as the sample quality, qPCR chemistry, and technical problems (e.&nbsp;g., sensor errors) contribute to various curve shapes [@ruijter_2014]. The curvature of the amplification curve can be used as a quality measure. For example, fragmentation, inhibitors and sample material handling errors during the extraction can be identified. The kinetic of fluorescence emission is proportional to the quantity of the synthesized DNA. Typical amplification curves have three phases. 

1. **Ground phase**: This phase occurs during the first cycles of the PCR, where the fluorescence emission is in most cases flat. Here, noise but no product formation is detected by the sensor system and the PCR product signal is an insignificantly small component of the total signal. This is often referred to as base-line or background signal. Apparently, there is only a phase shift or no signal at all, primarily due to the limited sensitivity of the instrument. Even in a perfect PCR reaction (double amplification per cycle), qPCR instruments cannot detect the fluorescence signal from the amplification. Fragmentation, inhibitors and sample handling errors would result in a prolonged ground phase. Nevertheless, this may indicate some typical properties of the qPCR system or probe system. 
In many instruments, this phase is used to determine the base-line level for the calculation of the Cycle threshold (Ct). The Ct value is considered statistically relevant as an increase outside of the noise range (threshold) when coming from the amplicon. In some qPCR systems, a flat amplification signal is expected in this phase. Slight deviations from this trend are presumably due to changes (e.&nbsp;g., disintegration of probes) in the fluorophores. Background correction algorithms are often used here to ensure that flat amplification curves without slope are generated. However, this can result in errors and inevitably leads to a loss of information via the waveform of the raw data [@nolan_2006]. The slope, level and variance of this phase can serve as predictors.
2. **Exponential phase**: This phase follows the ground phase and is also called *log-linear phase*. It is characterized by a strong increase of the emitted fluorescence as the DNA amount roughly doubles in each cycle under ideal conditions and when the amount of the synthesized fluorescent labeled PCR product is high enough to be detected by the sensor system. This phase is used for the calculation of the quantification point (Cq) and curve specific amplification efficiency. The most important measurement from qPCRs is the Cq, which signifies the PCR cycle for which the fluorescence exceeds a ``threshold value``. However, there is an ongoing debate as to what a significant and robust threshold value is. An overview and performance comparison of Cq methods is given in @ruijter_evaluation_2013. There are several mathematical methods to calculate the Cq. 
    - The 'classical' threshold value (cycle threshold, Ct) is the intersection between a manually defined straight horizontal line with the quasi-linear phase in the exponential amplification phase (\autoref{figure_quantification_points}A & B). This simple to implement method requires that amplification curves are properly baselined prior to analysis. The Ct method makes the assumption that the amplification efficiency (~ slope in the log-linear phase) is equal across all compared amplification curves [@ruijter_evaluation_2013]. Evidently, this is not always case as exemplified in \autoref{amplification_curve_ROI}C. The Ct method is widely used presumably due to the familiarity of users with this approach (e.&nbsp;g., chemical analysis procedures).  However, this method is statistically unreliable [@ruijter_evaluation_2013; @spiess_impact_2015; @spiess_system-specific_2016]. Moreover, the Ct method gives no stable in predictions if different users are given the same data set to be analyzed. *Therefore, this method is not used within the \texttt{PCRedux} package*.
    - Another Cq method uses the maximum of the second derivative (SDM) [@roediger2015r] (\autoref{figure_quantification_points}C). In all cases, the Cq value can be used to calculate the concentration of target sequence in a sample (low Cq \textrightarrow high target concentration). In contrast, negative or ambiguous amplification curves loosely resemble noise. This noise may appear linear or exhibit a curvature similar to a specific amplification curve (\autoref{htPCR_nap}). This however, may result in faulty interpretation of the amplification curves. Fragmentation, inhibitors and sample handling errors would decrease the slope of the amplification curve [@spiess_highly_2008; @Ritz2008]. The slope and its variation can be considered as predictors. Since the Cq depends on the initial template amount and amplification efficiency, there seemingly is no immediate use of the Cq as an predictor. 
    3. **Plateau phase**: This phase follows the exponential phase and is a consequence of the exhaustion of limited reagents (incl. primers, nucleotides, enzyme activity) in the reaction vessel, limiting the amplification reaction, so that the theoretical maximum amplification efficiency (doubling per cycle) no longer prevails. This turning point, and the progressive limitation of resources, finally leads to a plateau. In the plateau phase, there is in some cases a signal decrease called *hook effect* (\autoref{why_PCRedux} and [@barratt_improving_2002; @isaac_essentials_2009; @burdukiewicz_algorithms_2018]). The slope (*hook effect*), level and variation can be considered as predictors. 

If the amplification curve has only a slight positive slope and no perceptible/measureable exponential phase, it can be assumed that the amplification reaction did not occur (\autoref{amplification_curve_ROI}B). Causes may include poor specificity of the PCR primers (non-specific PCR products), degraded sample material, degraded probes or detector failures. If a lot of input DNA is present in a sample, the amplification curve starts to increase in early PCR cycles (1 - 12 cycles). Some PCR devices have a software that corrects this feature without rechecking, resulting in an amplification curve with a negative trend.

The discussed phases are considered as regions of interest (ROI). As an example, the \textit{ground phase} is in the head area, while the \textit{plateau phase} is in the tail area. The \textit{exponential phase} is located between these two ROIs.


```{r amplification_curve_ROI, echo=FALSE, fig.scap=amplification_curve_ROI_short, fig.cap=amplification_curve_ROI, fig.height=8, fig.width=8}

library(qpcR)
library(PCRedux)

colors <- rainbow(10, alpha = 0.15)

x_range <- 1L:35
d <- testdat[x_range, ]
amp_data <- data.frame(
  d[, 1],
  pos = d[, 3] + 0.9,
  posReverse = (max(d[, 3]) - rev(d[, 3])) + 0.9,
  neg = d[, 4] + 0.9 + 0.0005 * d[, 1] ^ 2
)

# Calculation for the normal data
res_amp_data <- pcrfit(amp_data, 1, 2, l5)
res_takeoff <- takeoff(res_amp_data)

# Calculation of sd_bg

res_sd_bg <- sd(amp_data[1:res_takeoff[[1]], 2])

# Calculation for the reversed data
res_amp_data_reverse <- pcrfit(amp_data, 1, 3, l5)
res_takeoff_reverse <- takeoff(res_amp_data_reverse)
res_takeoff_reverse[[1]] <- nrow(d) - res_takeoff_reverse[[1]]
res_takeoff_reverse[[2]] <- amp_data[res_takeoff_reverse[[1]], 2] - res_takeoff_reverse[[2]] + min(amp_data[, 3])

exponentialRange <- c((res_takeoff[[1]] + 1):(res_takeoff_reverse[[1]] - 1))

backgroundplateu <- function(x) {
  bg <- mean(head(x, res_takeoff[[1]])) + 3 * sd(head(x, res_takeoff[[1]]))
  plat <- mean(tail(x, 10)) - 3 * sd(tail(x, 10))
  list(bg = bg, plateau = plat)
}

res_lm <- lm(amp_data[exponentialRange, 2] ~ amp_data[exponentialRange, 1])

y_lim <- max(amp_data[, 2:4]) * 1.15

res_bgpl <- unlist(backgroundplateu(amp_data[, 2]))

# Create graphic device for the plot(s)
layout(matrix(c(1, 1, 2, 3), 2, 2, byrow = TRUE), respect = TRUE)

plot(amp_data[, 1], amp_data[, 2], ylim = c(-0.1, y_lim), xlab = "Cycles", ylab = "RFU", type = "b", lwd = 2, pch = 19)

text(c(2,30), c(10.5,10.5), c("Head", "Tail"), cex = 1.2, col = "red")

rect(0, 0, res_takeoff[[1]] + 1, res_takeoff[[2]] * 1.25, col = colors[1], border = NA)
text(5, res_bgpl[1] * 1.45, "Ground phase")

rect(res_takeoff_reverse[[1]] - 1, res_takeoff_reverse[[2]] * 0.95, nrow(amp_data), y_lim, col = colors[5], border = NA)
text(32, res_bgpl[2] * 1.1, "Plateau phase")

text(res_takeoff_reverse[[1]], mean(amp_data[, 2]), "Exponential\nregion")


points(
  c(res_takeoff[[1]], res_takeoff_reverse[[1]]),
  c(res_takeoff[[2]], res_takeoff_reverse[[2]]), pch = 12, cex = 2.5
)

text(
    c(res_takeoff[[1]], res_takeoff_reverse[[1]]),
    c(res_takeoff[[2]], res_takeoff_reverse[[2]]) + c(1.05, -1.05), c("top", "tdp")
)


arrows(20, 0, 20, res_bgpl[1], code = 3, length = 0.1)
text(30, res_bgpl[1] / 2, "Background")

arrows(5, res_bgpl[2], 5, max(amp_data[, 2]), code = 3, length = 0.1)
text(15, res_bgpl[2] * 0.95, "Plateau")


abline(res_lm, col = "red")
points(amp_data[exponentialRange, 1], amp_data[exponentialRange, 2], pch = 19, col = "red")

abline(h = res_bgpl, col = c("green", "blue"))
abline(h = 0, col = "grey")

legend(2, 12, paste0(
  "Slope: ", signif(coef(res_lm)[2], 3),
  "\nBackground (mean): ", signif(res_bgpl[1], 3),
  "\nsd_bg: ", signif(res_sd_bg, 3),
  "\nPlateau: ", signif(res_bgpl[2], 3),
  "\ntop: ", signif(res_takeoff[[1]], 3),
  "\ntdp: ", signif(res_takeoff_reverse[[1]], 3)
), bty = "n")

mtext("A    Positive", cex = 1, side = 3, adj = 0, font = 2)


y_lim <- 2
plot(amp_data[, 1], amp_data[, 4], ylim = c(-0.1, y_lim), xlab = "Cycles", ylab = "RFU", type = "b", lwd = 2, pch = 19)
res_bgpl <- unlist(backgroundplateu(amp_data[, 4]))
abline(h = res_bgpl, col = c("green", "blue"))
abline(h = 0, col = "grey")

mtext("B    Negative", cex = 1, side = 3, adj = 0, font = 2)

curve_colors <- c(rainbow(ncol(boggy) - 1, alpha = .5))
matplot(boggy[, 1], boggy[, -1], type = "l", col = curve_colors, xlab = "Cycles", ylab = "RFU", lty = 1)
    rect(22, 2, 40, 2.3, border = "blue")
    text(27.5, 2.1, "Hook effect", col = "blue")
mtext("C    boggy data set", cex = 1, side = 3, adj = 0, font = 2)
```

The amplification curve shape, the amplification efficiency and the Cq value are important measures to judge the outcome of a qPCR reaction. In all phases of PCR, the curves should be smooth. Possible artifacts in the curves may be due to unstable light sources from the instrument or problems during sample preparation, such as the presence of bubbles in the reaction vessel, incorrectly assigned dye detectors, errors during the calibration of dyes for the instrument, errors during the preparation of the PCR master mix, sample degradation, lack of a sample in the PCR, too much sample material in the PCR mix or a low detection probe concentration [@ruijter_amplification_2009; @ruijter_2014; @spiess_impact_2015]. Smoothing and filtering cause alterations to the raw data that affects the Cq value and the amplification efficiency.

Most commercial qPCR systems do not display the raw data of the amplification curves on the screen. Instead, raw data are often processed by the instrument software to remove fluorophore-specific effects and noise in all ROIs. Commonly employed preprocessing step of qPCR is smoothing and filtering to remove noise, where the latter can have different causes [@spiess_impact_2015]. 

The ordinate often does not display the measured fluorescence, but rather the change in fluorescence per cycle ($\varDelta RFU = RFU_{cycle  + 1} - RFU_{cycle}$). Some qPCR systems display periodicity in the amplification curve data, thereby exposing the risk of introducing artificial shifts in the Cq values [@spiess_system-specific_2016]. 

In particular the cycle threshold method (Ct method) (\autoref{section_DataAnalysis}) is affected by these factors [@spiess_impact_2015; @spiess_system-specific_2016]. Therefore, it is advisable to clarify in advance, which processing steps the amplification curves have been subjected to. Failure to do so may result in misinterpretations and incorrect amplification curve fitting models [@nolan_2006; @roediger2015r; @roediger2015chippcr; @spiess_impact_2015].


```{r figure_quantification_points, results='hide', message=FALSE, echo=FALSE, fig.scap=figure_quantification_points_short, fig.cap=figure_quantification_points, fig.height=8, fig.width=8} 
library(qpcR)
library(chipPCR)

res_model <- pcrfit(testdat, cyc = 1, fluo = 2, model = l5)
res_takeoff <- takeoff(res_model, pval = 0.05, nsig = 3)

res_model_predict <- predict(res_model)

r_user <- 2.356

res_th.cyc <- th.cyc(testdat[, 1], testdat[, 2], r = r_user, linear = FALSE)

# Create graphic device for the plot(s)
layout(matrix(c(1, 2, 3, 3), 2, 2, byrow = TRUE), respect = TRUE)


plot(testdat[, 1], testdat[, 2], xlab = "Cycles", ylab = "Raw fluorescence")

abline(h = (mean(testdat[1:10, 2]) + 3 * sd(testdat[1:10, 2])), col = "grey")

abline(h = res_th.cyc[1, 2], col = "black")
text(28, r_user + 0.3, paste0("Threshold: ", r_user))
arrows(res_th.cyc[1, 1], res_th.cyc[1, 2], res_th.cyc[1, 1], 0, angle = 25, 
       length = 0.1, lwd = 2)
mtext(paste0("A     ", "Ct = ", signif(res_th.cyc[1, 1], 4)), cex = 1, side = 3, 
      adj = 0, font = 2)

plot(testdat[, 1], log(testdat[, 2]), xlab = "Cycles", 
     ylab = "log(Raw fluorescence)")
abline(h = log(res_th.cyc[1, 2]), col = "black")
arrows(res_th.cyc[1, 1], log(res_th.cyc[1, 2]), res_th.cyc[1, 1], 
       min(log(testdat[, 2]), na.rm = TRUE), angle = 25, length = 0.1, lwd = 2)
mtext(paste0("B     ", "Ct = ", signif(res_th.cyc[1, 1], 4)), cex = 1, 
      side = 3, adj = 0, font = 2)

res_efficiency <- efficiency(res_model)
cpDdiff <- sqrt((res_efficiency$cpD1 - res_efficiency$cpD2)^2)

arrows(res_takeoff[[1]], res_takeoff[[2]], res_takeoff[[1]], -0.2, angle = 25, 
       length = 0.1, lwd = 2)

abline(v = 19.5)

mtext(paste0("C   ",  "cpDdiff: ", cpDdiff), cex = 1, side = 3, adj = 0, 
      font = 2)
```

## Data Analysis Functions of the \texttt{PCRedux} Package \label{section_Functions_of_PCRedux}

### Helper Functions of the \texttt{PCRedux} Package  \label{section_helper_functions}

The \texttt{PCRedux} package contains functions for analyzing amplification curves. These are distinguished into helper functions and analysis functions. The details about the:
- ``performeR()`` - Performance Analysis for Binary Classification and
- ``qPCR2fdata()`` - A Helper Function to Convert Amplification Curve Data to the `fdata` format
can by found in the **[online supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)**.

### ``pcrfit_single()`` and ``encu()``- Predictor calculation from an Amplification Curve \label{section_pcrfit_single_pcrfit_parallel}

The following sections give a concise description of the algorithms used to calculate predictor vectors by the ``pcrfit_single()`` function. Based on considerations and experience, the algorithms of the ``pcrfit_single()`` function are restricted to ROIs (\autoref{amplification_curve_ROI}) to calculate specific predictors. 

The ``encu()`` function is a wrapper for the  ``pcrfit_single()`` function. ``encu()`` can be used to process large records of amplification curve data arranged in columns. The progress of processing is displayed in the form of a progress bar and the estimated run-time. Additionally, ``encu()`` allows to specify which monitoring chemistry (e.&nbsp;g., DNA binding dye, sequence specific probes) and which thermo-cycler was used. @ruijter_2014 demonstrated that the monitoring chemistry and the type of input DNA (single stranded, double stranded) are important when analysing qPCR data, because they have an influence on the shape of the amplification curve. For simplicity, the documentation will describe the ``pcrfit_single()`` only.


The underlying hypotheses and concepts of the predictors are formulated and supported by  *exemplary applications*. Different representative data sets were used to support a concept or predictors. For example, the `RAS002` data set represents a typical qPCR. This means that the positive amplification curves start with a flat plateau phase and then transition into the sigmoid shape with a plateau. The negative amplification curves display no significant peculiarities. For both positive and negative amplification curves, there is a shift from the origin. The `htPCR` data set serves as a problem example in several places, since it contains many observations (amplification curves from high-throughput experiments). In addition, the amplification curves have a high diversity of curve shapes that cannot be uniquely and reproducibly classified even by experienced users. Other data sets are used in the documentation, but these are not discussed in detail. 

To underscore the usability of the algorithms and their predictors, `r d <- PCRedux::data_sample; paste0(nrow(d), " observations ", "(", sum(d[["decision"]] == "n"), " negative amplification curves, ", sum(d[["decision"]] == "y"), " positive amplification curves)")` from the `batsch1`, `boggy`, `C126EG595`, `competimer`, `dil4reps94`, `guescini1`, `karlen1`, `lievens1`, `reps384`, `rutledge`, `testdat`, `vermeulen1`, `VIMCFX96_60`, `stepone_std`, `RAS002`, `RAS003`, `HCU32_aggR` and `lc96_bACTXY` were analyzed with the ``encu()`` function and the results (predictors) were combined in the file **`data_sample.rda`**. Users of this function should independently verify and validate the results of the methods for their own applications.

A new data set called `data_sample_subset_balanced` has been compiled from the `data_sample` data set for some of the applications. Selection criteria included:

- both positive and negative amplification curves had to be included in a similar ratio,
- there should not be a dominating thermal cycler platform,
- the amplification curves should represent typical amplification curves (subjective criterion).
The compilation of the data sets `batsch1`, `HCU32_aggR`, `lc96_bACTXY`, `RAS002`, `RAS003` and `stepone_std` met this requirement satisfactorily.


```{r, echo=FALSE}
data_sample_subset_balanced <- data_sample[data_sample$dataset %in% 
c("batsch1", "boggy", "C126EG595", "HCU32_aggR", "lc96_bACTXY",
  "RAS002", "RAS003", "stepone_std", "testdat"), ]
  
# Dimension of data_sample_subset_balanced
dim(data_sample_subset_balanced) ## Observations predictors
```

```{r, echo=FALSE, fig.width=3.8}
# Show the counts of negative and positive amplification 
# curves in a bar plot
# Build a contingency table of the counts at each
# combination of factor levels.

dec_table<- table(data_sample_subset_balanced[["decision"]])
barplot(dec_table, ylab = "Number of Observations", col = c("green", "black"), 
        border = "white")
text(c(0.7, 1.9), rep(min(dec_table) *0.9, length(dec_table)), 
     c(paste("y = ", dec_table[1]), paste("n = ", dec_table[2])),
     col = c("black", "white"))
mtext("data_sample_subset_balanced", cex = 1, side = 3, adj = 0, font = 2, 
      las = 0)
```

For the comparison of predictors, the data set was enlarged. Selection criteria for the data sets were comparatively less stringent.

```{r, echo=TRUE}
data_sample_subset <- data_sample[data_sample$dataset %in% c("stepone_std", 
                                                             "RAS002", "RAS003", 
                                                             "lc96_bACTXY", 
                                                             "C126EG595", 
                                                             "dil4reps94",
                                                             "testdat", 
                                                             "boggy"), ]

# Dimension of data_sample_subset
dim(data_sample_subset) ## Observations predictors
```

```{r, echo=TRUE, fig.width=3.8}
# Show the counts of negative and positive amplification 
# curves in a bar plot
# Build a contingency table of the counts at each
# combination of factor levels.

dec_table<- table(data_sample_subset[["decision"]])
barplot(dec_table, ylab = "Number of Observations", col = c("green", "black"), 
        border = "white")
text(c(0.7, 1.9), rep(min(dec_table) *0.9, length(dec_table)), 
     c(paste("y = ", dec_table[1]), paste("n = ", dec_table[2])),
     col = c("black", "white"))
mtext("data_sample_subset", cex = 1, side = 3, adj = 0, font = 2, 
      las = 0)
```

The goal is to demonstrate the basic functionality of the algorithms for predictor calculation. Similar concepts are presented in groups.  The algorithms are divided into the following broad categories:

- algorithms that determine slopes, signal levels,
- algorithms that determine turning points and
- algorithms that determine areas.

The algorithms in 

- ``earlyreg()`` (\autoref{section_earlyreg}),
- ``head2tailratio()`` (\autoref{section_head2tailratio}),
- ``hookreg()``& ``hookregNL()`` (\autoref{section_hookreg}) and 
- ``mblrr()`` (\autoref{section_mblrr}),
- ``autocorrelation_test()`` (\autoref{section_autocorrelation_test})

were implemented as standalone functions to make them available for other applications. 

The output below shows the predictors and their data type (`num`, numeric; `int`, integer; `Factor`, factor; `logi`, boolean) that were determined with the ``pcrfit_single()`` function.


```{r, echo=TRUE}
library(PCRedux)
# Calculate predictor vector of column two from the RAS002 data set.
str(pcrfit_single(RAS002[, 2]))
```


### Amplification Curve Preprocessing

The ``pcrfit_single()`` function performs preprocessing steps before each calculation, including checking whether an amplification curve contains missing values. Missing values (NA) are measuring points in a data set where no measured values are available or have been removed arbitrarily. NAs may occur if no measurement has been carried out (e.&nbsp;g., defective detector) or lengths of the vectors differ (number of cycles) between the observations. Such missing values are automatically imputed by spline interpolation as described in @roediger2015chippcr. 

Values of an amplification curve are normalized to their 99\% quantile or rare cases to the maximum for many calculations. The normalization is used to equalize the amplitudes differences of amplification curves from thermo-cyclers (sensor technology, software processing) and detection chemistries. To compare amplification curves from different thermo-cyclers, the values should always be scaled systematically using the same method. Although there are other normalization methods (e.&nbsp;g., minimum-maximum normalization, see @roediger2015chippcr, the normalization by the 99\% quantile preserves the information about the level of the background phase. A normalization to the maximum is not used to avoid strong extenuation by outliers. The data in \autoref{plot_bg_pt}D show that the `maxRFU` values after normalization are approximately 1. There is no statistical significant difference between `maxRFU` values of positive and negative amplification curves.

Selected algorithms of the ``pcrfit_single()`` function use the ``CPP()`` [\texttt{chipPCR}] function to preprocess (e.&nbsp;g., base-lining, smoothing, imputation of missing values) the amplification curves. Further details are given in @roediger2015chippcr. Until package version 0.2.6-4 was the ``visdat_pcrfit()`` part of the package. ``visdat_pcrfit()`` was used for visualizing the content of data from an analysis with the ``pcrfit_single()`` function. There are other more powerful packages such as \texttt{visdat} by @Tierney2017, \texttt{assertr} by @assertr and \texttt{xray} by @Seibelt_xray.

During the analysis, several values are determined to describe the amplitude of an amplification curve. The resulting potential predictors are `minRFU` (minimum of the amplification curve, which is determined at the 1\% quantile to minimize the influence of outliers), `init2` (the initial template fluorescence from an exponential model) and `fluo` (raw fluorescence value at the second derivative maximum). The `minRFU`, `init2` and `fluo` values differ significantly between negative and positive amplification curves (\autoref{plot_bg_pt}C, E & F).

### Handling of Missing Predictors

Missing values (NA) can occur if a calculation of a predictor is impossible (e.&nbsp;g., if a logistic function cannot be adapted to noisy raw data). The lack of a predictor is nevertheless an useful information (no predictor calculate $\mapsto$ amplification curves deviate from sigmoid shape). The NAs were left unchanged in the \texttt{PCRedux} package up to version 0.2.5-1. Since version 0.2.6 the NAs are replaced by numerical values (e.&nbsp;g., total number of cycles) or factors (e.&nbsp;g., *lNA* for non-fitted model). Under the term "imputation", there are a number of procedures based on statistical methods (e.&nbsp;g., neighboring median, spline interpolation) or on user-defined rules [@williams_rattle:_2009; @cook_interactive_2007; @hothorn_handbook_2014]. Rules are mainly used in the functions of \texttt{PCRedux} to relieve the user from the decision as to how to deal with missing values. For example, slope parameters of a model are set to zero when it cannot be determined. The disadvantage is that rules do not necessarily concur to real world values. 

### Multi-parametric Models for Amplification Curve Fitting\label{section_models}

Both the ``pcrfit_single()`` function and the ``encu()`` function use four multi-parametric models based on the findings of @spiess_highly_2008 and @Ritz2008. The ``pcrfit_single()`` function starts by adjusting a seven-parameter model since this adapts *easier* and more frequent to a data set (\autoref{plot_models}). 

* **l7**:

\begin{equation}\label{l7}
f(x) = c + k1 \cdot x + k2 \cdot x^2 + \frac{d - c}{(1 + exp(b(log(x) - log(e))))^f}
\end{equation}

From that model, the ``pcrfit_single()`` function estimates the variables `b_slope` and `c_intercept`, describing the slope and the y-intercept. The number of iterations required to adapt the model is also stored. That value is returned by the ``pcrfit_single()`` function as `convInfo_iteratons`. The higher the `convInfo_iteratons` value, the more iterations are necessary to converge from the start parameters (\autoref{plot_dat_EffTop}L). A low `convInfo_iteratons` value is an indicator for

* a sigmoid curve shape or 
* close start parameters. 

High iterations numbers imply 

* noisy amplification curves or 
* non-sigmoid amplification curves. 

The amplification curve fitting process continues with the four-parameter model (*l4*, \autoref{l4}). This is followed by a model with five parameters (*l5*, \autoref{l5}) and six parameters (*l6*, \autoref{l6}).

* **l4**: 

\begin{equation}\label{l4}
    f(x) = c + \frac{d - c}{1 + exp(b(log(x) - log(e)))}
\end{equation}


* **l5**: 

\begin{equation}\label{l5}
    f(x) = c + \frac{d - c}{(1 + exp(b(log(x) - log(e))))^f}
\end{equation}

* **l6**: 

\begin{equation}\label{l6}
    f(x) = c + k \cdot x + \frac{d - c}{(1 + exp(b(log(x) - log(e))))^f}
\end{equation}

The optimal model is selected on the basis of the Akaike information criterion and used for all further calculations. The ``pcrfit_single()`` function returns `qPCRmodel` as a factor (*l4*, *l5*, *l6*, *l7*). In case no model could be fitted, an *lNA* is returned.

The model is an indicator of the amplification curve shape. Model with many parameters deviate more from an ideal sigmoid model. For instance, a four-parameter model, unlike the six-parameter model, does not have a linear component. A negative linear slope in the plateau phase is an indicator of a *hook effect* [@burdukiewicz_algorithms_2018].


```{r plot_models, echo=FALSE, fig.cap=plot_models, fig.scap=plot_models_short, fig.height=3.5}
library(PCRedux)

x <- data_sample$decision
y <- factor(data_sample[["qPCRmodel"]], levels = c("lNA", "l4", "l5", "l6", "l7"))

res_fw <- rbind(negative = table(y[x == "n"]),
                positive = table(y[x == "y"])
)

# Define custom colors for the classes
colors <- c(adjustcolor("black", alpha.f = 0.25), adjustcolor("green", alpha.f = 0.25))

# Create graphic device for the plot(s)
par(mfrow = c(1,2))

barplot(res_fw / sum(res_fw) * 100, beside = TRUE, col = colors, xlab = "", 
        ylab = "Percentage", border = "white")
legend("top", c(
    paste0("Negative, n =", length(y[x == "n"])), paste0("Positive, n = ", length(y[x == "y"]))
), fill = colors, bty = "n")
mtext("A    Fitted models", cex = 1, side = 3, adj = 0, font = 2)

# Data used for the analysis
data <- data_sample

# Predictor that is going to be analyzed
predictor <- c("cpD2")

# Classes assigned manually be a human beforehand
x <- data$decision

# Helper function `densR()`, to plot density plot in a stripchart

densR <- function(data, decision, size = 0.4, position = 1.35){
    x <- decision
    y <- data[, colnames(data) == predictor[i]]
    y_density_neg <- density(y[x == "n"])
    y_density_pos <- density(y[x == "y"])
    
    max_density <- max(c(y_density_pos$y, y_density_neg$y))
    
    polygon(y_density_neg$y/max_density * size + position, y_density_neg$x, col = colors[1], border = NA)
    polygon(y_density_pos$y/max_density * size + position, y_density_pos$x, col = adjustcolor("green", alpha.f = 0.25), border = NA)
}

for(i in 1L:length(predictor)) {
    y <- data[, colnames(data) == predictor[i]]
    
    res <- stats::wilcox.test(y ~ x)
    h <- max(na.omit(y))
    l <- min(na.omit(y))
    h_text <- rep(h * 0.976, 2)
    
    par(bg=NA)
    stripchart(y ~ x, vertical = TRUE, ylab = "Cq (cpD2)",
               method = "jitter", pch = 20, cex = 0.7, 
               col = adjustcolor("black", alpha.f = 0.8), 
               ylim = c(l * 0.95, h * 1.05))
    
    densR(data = data, decision = x, size = 0.4, position = 1.35)
    
    arrows(1,-0.25,1.5,25, length = 0.1)
    arrows(2,-0.25,1.5,5, length = 0.1)
    
    boxplot(y ~ x, outline = FALSE, add = TRUE, boxwex = 0.35)
    
    legend("topleft", paste0("P = ", signif(res[["p.value"]])), 
           cex = 1, bty = "n")
    
    mtext(paste0("B", "   ", predictor[i]), cex = 1, side = 3, 
          adj = 0, font = 2, col = ifelse(signif(res[["p.value"]], 2) < 0.05, 
                                          "black", "red"))
}
```

### ``winklR()`` - A function to calculate the central angle based on the first and the second derivative of an amplification curve data

``winklR()`` is a function to calculate the in the trajectory of the first negative and the second negative derivatives maxima and minima (\autoref{winklR}) of an amplification curve data from a quantitative PCR experiment. For the determination of the angle, the origin is the maximum of the first derivative. On this basis, the vectors to the approximate minimum and maximum of the second derivatives are determined. The vectors result from the relation of the maximum of the first derivative to the minimum of the second derivative and from the maximum of the first derivative to the maximum of the second derivative. In a simple trigonometric approach, the scalar product of the two vectors is formed first. Then the absolute values are calculated and multiplied by each other. Finally, the value is converted into an angle with the cosine. The assumption is that flat (negative amplification curves) have a large angle and sigmoid (positive amplification curves) have a smaller angle. Another assumption is that this angle is independent of the rotation of the amplification curve. This means that systematic off-sets, such as those caused by incorrect background correction, are of no consequence. The cycles to be analyzed is defined by the user. The output contains the angle and the coordinates of the minima and maxima.

```{r, echo=FALSE, fig.cap=winklR_principle, fig.scap=winklR_principle_short, fig.height=4, fig.width=9}

par(mfrow = c(1,2))
for(i in 1:2){

columns <- c(2,3) + 2

class <- c("positive", "negative")

res <- winklR(RAS002[, 1], RAS002[, columns[i]], preprocess = TRUE)

y_lim_range <- range(c(RAS002[, columns[i]], res$origin[2], res[["p1"]][2], res[["p2"]][2]))

plot(RAS002[, 1], RAS002[, columns[i]], type = "l", ylim = y_lim_range, 
     xlab = "Cycles", ylab = "RFU")

mtext(paste0(LETTERS[i], "   ", class[i]), cex = 1, side = 3, 
          adj = 0, font = 2)


points(res$origin, col = "red", pch = 19)
points(res[["p1"]], col = "green", pch = 19)
points(res[["p2"]], col = "blue", pch = 19)

arrows(res$origin[[1]], res$origin[[2]], res[["p1"]][[1]], res[["p1"]][[2]], length = .05, col = "green")
arrows(res$origin[[1]], res$origin[[2]], res[["p2"]][[1]], res[["p2"]][[2]], length = .05, col = "blue")

text(res$origin[1], res$origin[2] + 250, signif(res[["angle"]], 2))
}

```


```{r winklR, echo=FALSE, fig.cap=winklR, fig.scap=winklR_short, fig.height=4, fig.width=9} 
# Calculate the central angles for amplification curves from the RAS002 data set.
library(PCRedux)

# Load the amplification curves from the RAS002 data set.

DATA <- PCRedux::RAS002

# Load the RAS002_decisions data set.
dec <- RAS002_decisions

# Give tabular output of classes
table(dec)

# Assign colors to the classes (n: black, y: green).

colors <- factor(dec, levels = c("y", "n"), 
                 label = c("black", "red"))

# Test for the angles via the winklR() function on the RAS002 data set

res_winklR <- sapply(2L:ncol(DATA), function(i) {
    winklR(DATA[, 1], DATA[, i])$angle
})

# Plot the results
par(mfrow = c(1,2))

# Use a stripchart to show the angles of positive (y)
# and negative (n) amplification curves
stripchart(res_winklR ~ dec, method = "jitter", 
           vertical = TRUE, pch = 19, xlab = "class", ylab = "angle")
mtext("A", cex = 1, side = 3, adj = 0, font = 2, las = 0)

# Use a calculates the conditional densities of the angles 
# based on the values of y weighted by the boundary 
# distribution of y

cdplot(as.factor(dec) ~ res_winklR, 
       xlab = "angle", ylab = "decision")
mtext("B", cex = 1, side = 3, adj = 0, font = 2, las = 0)

```


### Quantification Points, Ratios and Slopes\label{section_Areas_Ratios_Slopes}

The ``pcrfit_single()`` function  calculates `cpD1` and `cpD2` and uses them for further analysis. Both the `cpD1` and `cpD2` value are used to describe the amplification reaction quantitatively. For example, low `cpD1` and `cpD2` values (< 5 cycles) indicate that the PCR reaction was negative or that the amount of input DNA was to high \autoref{plot_models}B. Since the ``pcrfit_single()`` function gives the parameters of all models (\autoref{section_models}) they are part of the feature set for completeness (\autoref{plot_model_param}). In particular, the results of the five-parameter function and of the seven-parameter function are reported.


```{r plot_model_param, echo=FALSE, fig.cap=plot_model_param, fig.scap=plot_model_param_short, fig.height=11, fig.width=11}
library(PCRedux)

data <- data_sample_subset

predictor <- c("b_model_param", "c_model_param", "d_model_param", "e_model_param",
               "f_model_param")

# Create graphic device for the plot(s)
par(mfrow = c(3,3))

x <- data$decision

for(i in 1L:length(predictor)) {
    y <- data[, colnames(data) == predictor[i]]
    res <- stats::wilcox.test(y ~ x)
    h <- max(na.omit(y))
    l <- min(na.omit(y)) 
    h_text <- rep(h * 0.976, 2)
    
    par(bg=NA)
    stripchart(y ~ x, vertical = TRUE, ylab = predictor[i],
               method = "jitter", pch = 19, cex = 1, 
               col = adjustcolor("darkgrey", alpha.f = 0.65), 
               ylim = c(l * 0.95, h * 1.05))
    
    boxplot(y ~ x, outline = FALSE, add = TRUE, boxwex = 0.35)
    
    legend("topleft", paste0("P = ", signif(res[["p.value"]])), 
           cex = 1, bty = "n")
    
    mtext(paste0(LETTERS[i], "   ", predictor[i]), cex = 1, side = 3, 
          adj = 0, font = 2, col = ifelse(signif(res[["p.value"]]) < 0.05, 
                                          "black", "red"))
}
```


Further predictors from the ``pcrfit_single()`` function are: 

- `eff` is the optimized PCR efficiency found within a sliding window (\autoref{figure_quantification_points}C). A linear model of cycles versus log(Fluorescence) is fit within a sliding window (for details see ``sliwin()`` [\texttt{qpcR}] function). The comparison of positive and negative amplification curves in \autoref{plot_dat_EffTop}A demonstrates that the classes are significantly different from each other. The `eff` values differ significantly between negative and positive amplification curves (\autoref{plot_dat_EffTop}A).
- `sliwin` is the PCR efficiency by the 'window-of-linearity' method [@spiess_highly_2008] (\autoref{plot_dat_EffTop}B). The `sliwin` values differ significantly between negative and positive amplification curves (\autoref{plot_dat_EffTop}B).
- `cpDdiff` is the difference between the first (`cpD1`) and the second derivative maximum `cpD2` ($cpDdiff = cpD1 - cpD2$) **from the fitted model** (\autoref{figure_quantification_points}C). Provided that a model can be exactly fitted, the estimates of the difference are reliable. Higher `cpDdiff` values indicate a negative amplification reaction or a very low amplification efficiency. The comparison of positive and negative amplification curves in \autoref{plot_dat_EffTop}C demonstrates that the classes are significantly different from each other. In the event that the `cpDdiff` value cannot be determined (NA), it is replaced by zero. The `cpDdiff` values differ significantly between negative and positive amplification curves (\autoref{plot_dat_EffTop}C).
- `cpD2_range` is the absolute value of the difference between the minimum and the maximum of the second derivative maximum ($cpD2\_range = |cpD2m - cpD2|$) from the ``diffQ2()`` function (**no model fitted**) (\autoref{figure_cpD2_range}E). The `cpD2_range` value does not require an adjustment of a multiparametric model. The approximate first and second derivatives are determined using a five-point stencil [@roediger2015chippcr]. The comparison of positive and negative amplification curves in \autoref{plot_dat_EffTop}E shows that the classes differ significantly from each other. In the event that the `cpD2_range` value cannot be determined (NA), it is replaced by zero. The `cpD2_range` values differ significantly between negative and positive amplification curves (\autoref{plot_dat_EffTop}E).
- `cpD2_approx` is the approximate second derivative maximum. In most cases the value should be close to the `cpD2`. Deviations indicate noise in the data, negative amplification curves or positive amplification curves that deviate from a typical sigmoid amplification curve.
- `cpD2_ratio` is is the ratio between the the approximate second derivative maximum `cpD2_approx` and the second derivative maximum `cpD2` ($cpD2_ratio = cpD2 / cpD2_approx$). In the event that the `cpD2_ratio` value cannot be determined (NA, Inf), it is replaced by zero. Provided that a model can be exactly fitted and that the function has little noise the ratio between both values should be close to 1. Note: Empirical data suggest that an interval of 0.85 and 1.1 indicates positive amplification curves. These can be set to 1. Values outside this interval indicate non-sigmoidal (e.&nbsp;g.,, negative) amplification curves. These can be set to 0.


```{r figure_cpD2_range, echo=FALSE, results='hide', fig.cap=figure_cpD2_range, fig.scap=figure_cpD2_range_short, fig.height=4} 
library(chipPCR)
data <- cbind(RAS002[, 1], CPP(RAS002[, c(1,2)], method.norm = "minm")$y.norm)

# Invoke the inder() function for the object data to interpolate 
# the derivatives of the simulated data as object res. The Nip 
# parameter was set to 5. This leads to smoother curves. res is
# an object of the class "der".
res <- inder(data[-c(1,2),], Nip = 5)

# Plot the object res and add descriptions to the elements.

par(las = 0, bty = "n", oma = c(.5,.5,.5,.5))

plot(data, xlab = "Cycles", ylab = "RFU", ylim = c(0,1),
     main = "", type = "b", pch = 20, lwd = 2)

# Add graphical elements for the derivatives and the calculated
# Cq values FDM, SDM and SDm.

lines(res[, "x"], res[, "d1y"], col = "blue", lwd = 2)
lines(res[, "x"], res[, "d2y"], col = "red", lwd = 2)

# Fetch the Cq values from res with the summary() function
summ <- summary(res, print = FALSE)

abline(v = c(summ[[1]], summ[[2]], summ[[3]]), col = c("blue", "red", "red"), 
       lwd = 2)

text(summ[[2]] - 2, 0.5, paste("cpD2\n", round(summ["SDM"], 2)), 
     cex = 1.1, col = "red")
text(summ[[1]] + 2, 0.7, paste("cpD1\n", round(summ["FDM"], 2)), 
     cex = 1.1, col = "blue")
text(summ[[3]] + 2, 0.5, paste("cpD2m\n", round(summ["SDm"], 2)), 
     cex = 1.1, col = "red")

arrows(summ[[2]], 0.85, summ[[3]], 0.85, code = 3, length = 0.1)
text(summ[[1]] - 6, 0.85, paste("cpD2_range \n", 
                           round(abs(summ["SDM"] - summ["SDm"]), 2)), 
     cex = 1.1, col = "black")

legend("topleft", c("Amplification curve", "First derivative", "Second derivative"), 
       col = c(1,4,2), lty = c(2,1,1), bty = "n")

res <- data
background <- bg.max(res[-c(1,2), 1], res[-c(1,2), 2])


abline(v = background@bg.stop, col = "grey")
text(background@bg.stop, 0.3, "bg.stop", pos = 4, col = "grey", srt=90)
abline(v = background@amp.stop, col = "grey")
text(background@amp.stop, 0.3, "amp.stop", pos = 4, col = "grey", srt=90)
```

- `bg.stop` is the end of the ground phase and `amp.stop` is the end of the exponential phase estimated by the ``bg.max()`` [\texttt{chipPCR}] function [@roediger2015chippcr]. A graphical presentation of the locations in the amplification curve are shown in \autoref{figure_cpD2_range}. The `bg.stop` and `bg.stop` values differ significantly between negative and positive amplification curves (\autoref{plot_dat_EffTop}J & K).

- `top` is the *takeoff point* as proposed by @tichopad_standardized_2003. The `top` is calculated using externally studentized residuals, which tested to be an outlier in terms of the t-distribution. The `top` signifies to first PCR cycle entering the exponential phase. `tdp` is the *takedown point*. This is an implementation in the ``pcrfit_single()`` function, which uses the rotated $f(x) \mapsto f_{1}(f(x))$ and flipped $g(x) = -(x)$ amplification curve for calculation. \autoref{amplification_curve_ROI}A describes the location of `top` and `tdp`. The position (`f.top`, `f.tdp`) on the ordinate is also determined from these points. If an amplification curve is negative or neither `top` nor `tdp` can be calculated, then `top` & `tdp` will be assigned the number of cycles and `f.top` & `f.tdp` the value 1.
The distribution of `top`, `tdp`, `f.top` and `f.tdp` is shown in \autoref{plot_dat_EffTop}F-I. The `top`, `tdp`, `f.top` and `f.tdp` values differ significantly between negative and positive amplification curves. Potentially they enable a qualitative classification of the amplification reaction. An interesting aspect is that the positive `f.top` values are markedly lower than the negative `f.top` values The same applies inversely to the `tdp` values. In this way, amplification curves can be classified according to these values.

- `peaks_ratio` is based on a sequential chaining of functions. The ``diffQ()`` [\texttt{MBmca}] function determines numerically the first derivative of an amplification curve. This derivative is passed to the ``mcaPeaks()`` [\texttt{MBmca}] function. In the output all minima and all maxima are contained. The ranges are calculated from the minima and maxima. The Lagged Difference is determined from the ranges of the minima and maxima. Finally, the ratio of the differences (maximum/minimum) is calculated.
The `peaks_ratio` values differ significantly between negative and positive amplification curves (\autoref{plot_cp_area}B).

```{r plot_dat_EffTop, echo=FALSE, fig.cap=plot_dat_EffTop, fig.scap=plot_dat_EffTop_short, fig.height=11, fig.width=11} 

library(PCRedux)

data <- data_sample_subset

predictor <- c("eff", "sliwin", 
             "cpDdiff", "loglin_slope", "cpD2_range", "top", 
             "f.top", "tdp", "f.tdp", "bg.stop", 
             "amp.stop", "convInfo_iteratons")

# Create graphic device for the plot(s)
par(mfrow = c(3,4))

x <- data$decision

for(i in 1L:length(predictor)) {
    y <- data[, colnames(data) == predictor[i]]
    res <- stats::wilcox.test(y ~ x)
    h <- max(na.omit(y))
    l <- min(na.omit(y))
    h_text <- rep(h * 0.976, 2)
    
    par(bg=NA)
    stripchart(y ~ x, vertical = TRUE, ylab = predictor[i],
               method = "jitter", pch = 19, cex = 1, 
               col = adjustcolor("darkgrey", alpha.f = 0.65), 
               ylim = c(l * 0.95, h * 1.05))
    
#     densR(data = data, decision = x, size = 0.4, position = 1.35)
    
    boxplot(y ~ x, outline = FALSE, add = TRUE, boxwex = 0.35)
    
    legend("topleft", paste0("P = ", signif(res[["p.value"]])), 
           cex = 1, bty = "n")
    
    mtext(paste0(LETTERS[i], "   ", predictor[i]), cex = 1, side = 3, 
          adj = 0, font = 2, col = ifelse(signif(res[["p.value"]]) < 0.05, 
                                          "black", "red"))
}
```

- `loglin_slope` is calculated from the slope determined by a linear model of the data points from the cycle dependent fluorescence at the minimum of the second derivative and maximum of the second derivative (\autoref{loglin_slope}), provided that the locations of the minimum of the second derivative and the maximum of the second derivative yield a *suitable* interval. As a precaution, the algorithm checks, for example, whether the distance between the minimum of the second derivative and the maximum of the second derivative is not more than nine PCR cycles. Failing this, the `loglin_slope` value is set to zero (no slope), as in the example of \autoref{loglin_slope}. The `loglin_slope` values differ significantly between negative and positive amplification curves (\autoref{plot_dat_EffTop}D).


```{r loglin_slope, eval=TRUE, echo=FALSE, fig.cap=loglin_slope, fig.scap=loglin_slope_short, fig.height=3}
# Load example data (observation F6.1) from the testdat data set
# Load MBmca package to calculate the minimum, and the maximum of the second
# derivative

library(MBmca)

# Create graphic device for the plot(s)
par(mfrow = c(1, 2))

cyc <- 1

data <- RAS002[, c("cyc", "A01_gDNA.._unkn_B.Globin", "H10_ntc_ntc_B.Globin")]
col_names <- colnames(data)
data <- cbind(data[, 1], data[, 2]/quantile(data[, 2], .99), data[, 3]/quantile(data[, 3], .99))
colnames(data) <- col_names
# Calculate the minimum, and the maximum of the second
# derivative and assign it to the object res_diffQ2

rfu <- c(2,3)

for (i in rfu){
obs <- colnames(data)[i]
data_tmp <- data.frame(cyc = data[, cyc], rfu = data[, i])
res_diffQ2 <- suppressMessages(diffQ2(data_tmp, plot = FALSE, fct = min, inder = TRUE))
ROI <- round(c(res_diffQ2[[1]], res_diffQ2[[3]]))
# Build a linear model from der second derivative of res_diffQ2
res_loglin_lm <- lm(data_tmp[ROI, 2] ~ ROI)

plot(data_tmp, xlab = "Cycles", ylab = "normalized RFU", main = "", type = "l",
    lty = 1, lwd = 2, col = "black", ylim = c(0.6,1)
)
abline(res_loglin_lm, col = "red", lwd = 2)
abline(v = ROI, col = "grey", lwd = 2)
mtext(paste(LETTERS[i - 1], obs), side = 3, adj = 0, font = 1)
legend("bottomright", paste("Slope: ", signif(
    coefficients(res_loglin_lm)["ROI"],
                                            4
)), bty = "n")
mtext("", side = 3, adj = 0, font = 2)
}
```

The predictor `loglin_slope` is used in the following to test if the slope within this ROI can be used to distinguish positive and negative amplification curves. The hypothesis is that positive amplification curves have a higher `loglin_slope` than negative amplification curves. As shown in \autoref{plot_dat_EffTop}D, there is a statistically significant difference between positive and negative amplification curves.


The `loglin_slope` values from the `data_sample_subset_balanced` data set (\autoref{section_pcrfit_single_pcrfit_parallel}) were used to save computing time. A binomial logistic regression (see \autoref{section_technologies_amplification_curves_ML})) was used to analyze the relationship between the `loglin_slope` value and the class (negative, positive). The data set was split into two chunks. This is an important step during such applications. One chunk is for adapting, i. e. training, the model and the other chunk for testing the model. By convention, 70% to 80% of the data is used for training [@walsh_correct_2015; @kuhn_building_2008]. The binomial logistic regression model was adapted using the ``glm()`` [\texttt{stats}] function by using the parameter `family = binomial(link = 'logit')`. To objectify the splitting, the ``sample()`` [\texttt{stats}] function was used.


```{r, echo=TRUE}
library(PCRedux)

data <- data_sample_subset_balanced

n_positive <- sum(data[["decision"]] == "y")
n_negative <- sum(data[["decision"]] == "n")

dat <- data.frame(loglin_slope = data[, "loglin_slope"], 
                  decision = as.numeric(factor(data$decision, 
                                               levels = c("n", "y"), 
                                               label = c(0, 1))) - 1)

# Select randomly observations from 70% of the data for training.
# n_train is the number of observations used for training.

n_train <- round(nrow(data) * 0.7)

paste0("Percentage of observations (", n_train, ") = ", 
       signif((n_train/nrow(data)*100),3), "%")

# index_test is the index of observations to be selected for the training
index_test <- sample(1L:nrow(dat), size = n_train)

# index_test is the index of observations to be selected for the testing
index_training <- which(!(1L:nrow(dat) %in% index_test))

# train_data contains the data used for training

train_data <- dat[index_test, ]

# test_data contains the data used for training

test_data <- dat[index_training, ]

# Fit the binomial logistic regression model

model_glm <- glm(decision ~ loglin_slope, family=binomial(link='logit'), 
                 data = train_data)

predictions <- ifelse(predict(model_glm, 
                              newdata = test_data, type = 'response') > 0.5,
                      1, 0)

res_performeR <- performeR(predictions, test_data[["decision"]])[, c(1:10, 12)]
```

The ``summary()`` function returns the results of the model fitting. This can be analysed and interpreted. 


```{r, echo=TRUE}
summary(model_glm)
```

Based on the results it can be concluded that the parameters ` (intercept)` and `loglin_slope` are statistically significant (*P* < 2e-16). This indicates an association between `loglin_slope` and the probability that an amplification curve is positive.

In order to apply the model to a new data set, further steps are necessary. ``predict()`` [\texttt{stats}] is a generic function for predicting the results of a model fitting function (\autoref{plot_Logistic_Regression}A). All previously split test data is passed to the function argument `newdata`. By setting the `type = 'response'` parameter, the ``predict()`` function returns probabilities in the form of $P(y=1|X)$. In the case in hand, it was decided that a decision limit of 0.5 is to be applied. If $P(y=1|X) < 0.5$ then $y = 0$ (amplification curve negative), otherwise $y = 1$ (amplification curves positive).


```{r plot_Logistic_Regression, echo=TRUE, fig.cap=plot_Logistic_Regression, fig.scap=plot_Logistic_Regression_short, fig.height=3.7}
library(PCRedux)

# Create graphic device for the plot(s)
# Plot train_data (grey points) and the predicted model (blue)
par(mfrow = c(1,2))


plot(train_data$loglin_slope, train_data$decision, pch = 19, 
     xlab = "loglin_slope", ylab = "Probability", 
     col = adjustcolor("grey", alpha.f = 0.9), cex = 1.5)
mtext("A", cex = 1, side = 3, adj = 0, font = 2, las = 0)
abline(h = 0.5, col = "grey")

curve(predict(model_glm, data.frame(loglin_slope = x), type = "resp"), 
      add = TRUE, col = "blue")

# Plot test_data (red)

points(test_data$loglin_slope, test_data$decision, pch = 19,
       col = adjustcolor("red", alpha.f = 0.3))
legend("right", paste("Positive: ", n_positive, 
                      "\nNegative: ", n_negative), bty = "n")


# Plot the sensitivity, specificity and other measures to describe 
# the prediction.

position_bp <- barplot(as.matrix(res_performeR), yaxt = "n", 
                       ylab = "Probability", main = "", las = 2, 
                       col = adjustcolor("grey", alpha.f = 0.5), 
                       border = "white")

par(srt = 90)
text(position_bp, rep(0.8, length(res_performeR)), 
     paste(signif(res_performeR, 2)*100, "%"), cex = 0.6)
axis(2, at = c(0, 1), labels = c("0", "1"), las = 2)
abline(h = 0.85, col = "grey")

mtext("B", cex = 1, side = 3, adj = 0, font = 2, las = 0)
```

Sensitivity, specificity and further parameters for estimating predictions were calculated using the ``performeR()`` function (\autoref{section_helper_functions}). The results indicate that the sensitivity and specificity for the test data set provides a good result. However, in this case, they depend heavily on the computer-aided random sampling of the training data, and the total size of the data set (\autoref{plot_Logistic_Regression}B). Over-fitting and under-fitting and other problems need to be addressed [@walsh_correct_2015].

To proof the results, further methods such as Likelihood Ratio Test, McFadden's $R^{2}$, k-fold cross-validation, Receiver Operating Characteristic (ROC) analysis and model interpretation should be used [@arlot_survey_2010; @mcfadden_conditional_1974; @sing_rocr:_2005].


- `sd_bg` is the standard deviation from the first PCR cycle to the takeoff point (\autoref{amplification_curve_ROI}A). Manufacturers of thermo-cyclers use different sensors and data processing algorithms. The same applies to the detection chemistry used in experiments \autoref{why_PCRedux}. The signal variation in the ground phase differs between the different systems (\autoref{figure_sigmoid_curve_models}D). If no takeoff point can be determined from an amplification curve, the value for `sd_bg` is calculated from the first to the eighth PCR cycle. The results for the predictor `sd_bg` were broken down by the thermo-cycler and the output of the amplification reaction (negative, positive). It can be seen that the signal variation between the thermo-cyclers seems to be different. There is also a difference between negative and positive amplification curves \autoref{plot_sd_bg}. The `sd_bg` values differ significantly between negative and positive amplification curves (\autoref{plot_bg_pt}J).


```{r plot_sd_bg, echo=FALSE, fig.cap=plot_sd_bg, fig.scap=plot_sd_bg_short, fig.height=3}
library(PCRedux)
data <- data_sample[, c("device", "decision", "sd_bg")]

devices <- unique(data[data[["decision"]] == "n", "device"])[c(1,5,6)]

# Create graphic device for the plot(s)
par(mfrow = c(1,3))
for(i in 1L:length(devices)) {
    data_tmp <- data[data[, 1] == devices[i], ]
    x <- data_tmp[["decision"]]
    y <- data_tmp[["sd_bg"]]
    pos_neg <- summary(x)
    c(paste0("y (", pos_neg[["y"]], ")"), paste0("n (", pos_neg[["n"]], ")"))
    res <- stats::wilcox.test(y ~ x)
    
    par(bg=NA)
    stripchart(y ~ x, vertical = TRUE, 
               xlab = "Class decision",
               ylab = "Standard Deviation",
               method = "jitter", pch = 19, cex = 1, 
               col = adjustcolor("darkgrey", alpha.f = 0.65))
    
    boxplot(y ~ x, outline = FALSE, add = TRUE, boxwex = 0.35)
    legend("topleft", paste0("P = ", signif(res[["p.value"]])), 
           cex = 1, bty = "n")
    legend("center", c(paste0("y (", pos_neg[["y"]], ")"), 
                           paste0("n (", pos_neg[["n"]], ")")), 
           cex = 1, bty = "n")
    mtext(paste0(LETTERS[i], "   ", devices[i]), cex = 1, side = 3, 
          adj = 0, font = 2, col = ifelse(signif(res[["p.value"]]) < 0.05, 
                                          "black", "red"))
}
```


```{r plot_bg_pt, echo=FALSE, fig.cap=plot_bg_pt, fig.scap=plot_bg_pt_short, fig.height=11, fig.width=11}
library(PCRedux)

data <- data_sample_subset

predictor <- c("b_slope", "f_intercept", "minRFU", "maxRFU", 
             "init2", "fluo", "slope_bg", "intercept_bg", 
             "sigma_bg", "sd_bg", "head2tail_ratio", "central_angle")

# Create graphic device for the plot(s)
par(mfrow = c(3,4))

x <- data$decision

for(i in 1L:length(predictor)) {
    y <- data[, colnames(data) == predictor[i]]
    res <- stats::wilcox.test(y ~ x)
    h <- max(na.omit(y))
    l <- min(na.omit(y)) 
    h_text <- rep(h * 0.976, 2)
    
    par(bg=NA)
    stripchart(y ~ x, vertical = TRUE, ylab = predictor[i],
               method = "jitter", pch = 19, cex = 1, 
               col = adjustcolor("darkgrey", alpha.f = 0.65), 
               ylim = c(l * 0.95, h * 1.05))
    
    boxplot(y ~ x, outline = FALSE, add = TRUE, boxwex = 0.35)
    
    legend("topleft", paste0("P = ", signif(res[["p.value"]])), 
           cex = 1, bty = "n")
    
    mtext(paste0(LETTERS[i], "   ", predictor[i]), cex = 1, side = 3, 
          adj = 0, font = 2, col = ifelse(signif(res[["p.value"]]) < 0.05, 
                                          "black", "red"))
}
```


### Integration of the ``amptester()`` function in \texttt{PCRedux}

`amptester_polygon` is another method to calculate the area under an amplification curve. `amptester_polygon`\footnote{This predictor is determined from the points in an amplification curve (like a polygon, in particular non-convex polygons) in a `clockwise` order. The sum over the edges result in a positive value if the amplification curve is `clockwise` and is negative if the curve is `counter-clockwise`.} is part of the ``amptester()`` [\texttt{chipPCR}] package [@roediger2015chippcr]. In contrast to the implementation in ``amptester()``, `amptester_polygon` has values normalized to the total number of cycles, thereby allowing comparable predictions (\autoref{plot_cp_area}Ds).


```{r, eval=TRUE, echo=FALSE}
data(testdat, package = "qpcR")
x <- testdat[, 1]
y.pos <- testdat[, 2]
y.neg <- testdat[, 4]
nh <- trunc(length(x) * 0.20)
nt <- trunc(length(x) * 0.15)

y.pos.head <- head(y.pos, n = nh)
y.neg.head <- head(y.neg, n = nh)
y.pos.tail <- tail(y.pos, n = nt)
y.neg.tail <- tail(y.neg, n = nt)

lb.pos <- median(y.pos.head) + 2 * mad(y.pos.head)
ub.pos <- median(y.pos.tail) - 2 * mad(y.pos.tail)

lb.neg <- median(y.neg.head) + 2 * mad(y.neg.head)
ub.neg <- median(y.neg.tail) - 2 * mad(y.neg.tail)

res.shapiro.pos <- shapiro.test(y.pos)
res.shapiro.neg <- shapiro.test(y.neg)

res.wt.pos <- wilcox.test(head(y.pos, n = nh), tail(y.pos, n = nt), alternative = "less")
res.wt.neg <- wilcox.test(head(y.neg, n = nh), tail(y.neg, n = nt), alternative = "less")

###
RGt <- function(y) {
  ws <- ceiling((15 * length(y)) / 100)
  if (ws < 5) {
    ws <- 5
  }
  if (ws > 15) {
    ws <- 15
  }
  y.tmp <- na.omit(y[-c(1:5)])
  x <- 1:length(y.tmp)
  suppressWarnings(
    res.reg <- sapply(1L:(length(y.tmp)), function(i) {
      round(summary(lm(y.tmp[i:c(i + ws)] ~ x[i:c(i + ws)]))[["r.squared"]], 4)
    })
  )

  # Binarize R^2 values. Everything larger than 0.8 is positive
  res.LRt <- res.reg
  # Define the limits for the R^2 test
  res.LRt[res.LRt < 0.8] <- 0
  res.LRt[res.LRt >= 0.8] <- 1
  # Seek for a sequence of at least six positive values (R^2 >= 0.8)
  # The first five measuring point of the amplification curve are skipped
  # because most technologies and probe technologies tend to overshot
  # in the start (background) region.
  res.out <- sapply(5L:(length(res.LRt) - 6), function(i) {
    ifelse(sum(res.LRt[i:(i + 4)]) == 5, TRUE, FALSE)
  })
  out <- cbind(1L:(length(y.tmp)), res.reg)
  # res.out
}
```


```{r statistical_methods_amptester, echo=FALSE, fig.cap=statistical_methods_amptester, fig.scap=statistical_methods_amptester_short, fig.height=6.5}
# Create graphic device for the plot(s)
layout(matrix(c(1, 1, 2, 2, 1, 1, 2, 2, 
                3, 3, 4, 4, 5, 5, 6, 6), 4, 4, byrow = TRUE))

y_lim <- range(testdat[, 2]) * c(1, 1.3)

plot(
    x, y.pos, ylim = y_lim, xlab = "Cycles",
     ylab = "RFU", main = "", type = "b", pch = 19
)
mtext("A    Positive amplification", side = 3, adj = 0, font = 2)
abline(v = c(nh, length(x) - nt), lty = 3)

abline(h = lb.pos, lty = 2, col = "red")
text(35, 1.5, "Noise\nmedian + 2 * MAD", col = "red", cex = 1)

abline(h = ub.pos, lty = 2, col = "green")
text(35, 10, "Signal\nmedian - 2 * MAD", col = "green", cex = 1)

arrows(4.5, 12.5, 42.5, 12.5, length = 0.1, angle = 90, code = 3)
text(25, 14.5, paste(
    "W=", signif(res.wt.pos$statistic, 3), "\np-value = ",
                     signif(res.wt.pos$p.value, 3)
))
# text(25, 5, paste("Fold change: \n", round(ub.pos / lb.pos, 2)))


y_lim <- range(testdat[, 4]) * c(1.5, 2.5)

plot(
    x, y.neg, ylim = y_lim, xlab = "Cycles",
     ylab = "RFU", main = "", type = "b", pch = 19
)
mtext("B    Negative amplification", side = 3, adj = 0, font = 2)
abline(v = c(nh, length(x) - nt), lty = 3)

abline(h = lb.neg, lty = 2, col = "red")
text(10, 0.04, "Noise\nmedian + 2 * MAD", col = "red", cex = 1)

abline(h = ub.neg, lty = 2, col = "green")
text(40, -0.02, "Signal\nmedian - 2 * MAD", col = "green", cex = 1)

arrows(4.5, 0.04, 42.5, 0.04, length = 0.1, angle = 90, code = 3)
text(10, 0.06, paste(
    "W = ", res.wt.neg$statistic, "\np-value = ",
    signif(res.wt.neg$p.value, 3)
))

qqnorm(y.pos, pch = 19, main = "")
legend("bottomright", paste(
    "SHt, W = ", signif(res.shapiro.pos$statistic, 3),
                            "\np-value = ", signif(res.shapiro.pos$p.value, 3)
), bty = "n")
mtext("C", side = 3, adj = 0, font = 2)
qqline(y.pos, col = "orange", lwd = 2)


qqnorm(y.neg, pch = 19, main = "")
legend(
    "bottomright", paste(
        "SHt, W = ", signif(res.shapiro.neg$statistic, 3),
                         "\np-value = ", signif(res.shapiro.neg$p.value, 3)
    ),
    bty = "n"
)
mtext("D", side = 3, adj = 0, font = 2)
qqline(y.neg, col = "orange", lwd = 2)


plot(
    RGt(y.pos), xlab = "Cycles", ylab = expression(R ^ 2), main = "", pch = 19,
     type = "b"
)
mtext("E    LRt", side = 3, adj = 0, font = 2)
abline(h = 0.8, col = "black", lty = 2)


plot(
  RGt(y.neg), xlab = "Cycles", ylab = expression(R ^ 2), main = "",
  pch = 19, type = "b"
)
mtext("F    LRt", side = 3, adj = 0, font = 2)
abline(h = 0.8, col = "black", lty = 2)
```

### ``earlyreg()`` - A Function to Calculate the Slope and Intercept in the Ground Phase of an Amplification Curve \label{section_earlyreg}

The signal height and the slope in the first cycles (1 - 10) of amplification curves are potentially useful because some qPCR systems calibrate themselves by fluorescence intensity of the first cycles. This is noticeable as strong signal changes which appear spontaneously between the first and second cycle (e.&nbsp;g., \autoref{htPCR_nap}B). Furthermore, the signal level can be used to determine which background signal is present and whether the ground phase already has a slope. Moreover, characteristics of the detection probe system are noticeable (see \autoref{section_DataAnalysis}). From the slope, it may be deduced whether amplification has already started (see \autoref{section_DataAnalysis}). 

Consequently, the ``earlyreg()`` function was developed. This function uses an ordinary least squares linear regression within a limited number of cycles. As ROI, the first 10 cycles were defined. This restriction is based on the developers experience, suggesting that during the first ten cycles only a significant increase in signal strength can be measured within few qPCRs. However, ``earlyreg()`` does not ignore the first cycle, as many thermo-cyclers use this cycle for sensor calibration. Extreme values are therefore included. As standard, the next nine amplitude values are used for the linear regression. The number of cycles can also be adjusted via the parameter `range`. Since all amplification curves are normalized to the 99%-percentile, comparability between the background signals and the slopes is ensured. The output of the ``earlyreg()`` function is:

- `slope_bg`, which is the slope of the ordinary least squares linear regression model,
- `intercept_bg`, which is the intercept of the linear model and
- `sigma_bg`, which is the square root of the estimated variance of the random error. 

The `slope_bg`, `intercept_bg` and `sigma_bg` values differ significantly between negative and positive amplification curves (\autoref{plot_bg_pt}G-I).

[//]: # ( $\sigma^{2} = \frac{1}{n-p} \sum (w_{i} R_{i}^2)$, where $R_{i}$ is the i$_{th}$ residual, residuals$_{i}$)

The following example illustrates possible usage of ``earlyreg()``. For that purpose, amplification curves from the `C127EGHP` data set were analysed (\autoref{earlyreg_slopes}A). In figure \autoref{earlyreg_slopes}A the amplification curves for all cycles are shown. Next, the ``earlyreg()`` function was used to determine the `slope_bg`, `intercept_bg` and `sigma_bg` in the range of the first ten PCR cycles. The results were used in a cluster analysis using k-means clustering, demonstrating that the slope seems to be an indicator of differences between the amplification curves. The first 8 cycles were colored according to their cluster (\autoref{earlyreg_slopes}B). 
After cluster analysis, the same could also be observed (\autoref{earlyreg_slopes}D-F). Hence, it can be postulated that the slope in the background phase is useful for the amplification curve classification.


```{r earlyreg_slopes, echo=FALSE, fig.cap =earlyreg_slopes, fig.scap=earlyreg_slopes_short, fig.height=5}
library(PCRedux)

# box_cox() function for the Box-Cox transformation of data

box_cox <- function(x, lambda = 1, offset = 0) {
    if (lambda == 0) {
        log(x + offset)
    } else 
    {
        ((x + offset)^lambda - 1)/lambda
    }
}

# Load the C127EGHP data set
data <- chipPCR::C127EGHP[, -1]

# Normalize each amplification curve to their 0.99 percentile and use the
# earlyreg() function to determine the slope and intercept of the first
# cycles 'user_range'

user_range <- 8

res_earlyreg <- do.call(rbind, lapply(2L:ncol(data), function(i) {
  earlyreg(x = data[, 1], y = data[, i], range = user_range, normalize = TRUE)
})) %>% box_cox(.)

# Label the observation with their names
rownames(res_earlyreg) <- substr(colnames(data)[-1], 1, 10)

# Show the first five lines of the res_earlyreg data matrix

head(res_earlyreg)

# Perform k-means clustering on the res_earlyreg data matrix
cl <- kmeans(res_earlyreg, centers = 2)

# Plot the results
# Use x_roi (cycles) and rfu_range (RFU values) to limit the 
# range for the detailed plot of first cycles. 

x_roi <- 1:(user_range + 1)
rfu_range <- range(data[x_roi, -1])

# Create graphic device for the plot(s)
layout(matrix(c(1, 2, 1, 2, 3, 3), 3, 2, byrow = TRUE))

# Plot of raw amplification curves

matplot(
    data[, 1], data[, -1], ylim = range(data[, -1]), pch = 19, lty = 1, 
        type = "l", xlab = "Cycles", ylab = "RFU", main = "", col = "grey"
)
mtext("A", cex = 1, side = 3, adj = 0, font = 2)
abline(v = c(1,user_range))
text(3, range(data[, -1])[2], "ROI")

# Detailed plot of the first cycles and the clusters according
# to the k-means clustering
# Define some user colors (blue: EvaGreen, orange: Hydrolysis probes)
colors <- c(
    adjustcolor("blue", alpha.f = 0.5),
            adjustcolor("orange", alpha.f = 0.8)
)

matplot(NA, NA, xlim = range(x_roi), ylim = rfu_range, xlab = "Cycles", 
        ylab = "RFU", main = "")

for(i in 1L:length(unique(cl$cluster))) {
cl_id <- which(cl[["cluster"]] == i) + 1
par(new=TRUE)
matplot(data[x_roi, 1], xlab = "", ylab = "", xaxt = "n", yaxt = "n", 
        data[x_roi, cl_id], ylim = rfu_range, pch = 19, lty = 1, type = "l", 
        col = colors[i])
}


mtext("B", cex = 1, side = 3, adj = 0, font = 2)
abline(v = c(1,user_range))
text(3, rfu_range[2], "ROI")
legend("bottomleft", c("Cluster 1", "Cluster 2"), pch = 15, cex = 1.2, 
       col = colors, bty = "n")

# Overview of clusters and corresponding detection chemistry

eghp <- rep(0.7, length(cl$cluster))
names(eghp) <- names(cl$cluster)

barplot(eghp, las = 2, col = colors[cl[["cluster"]]], 
        border = "white", xlab = "", ylab = "", 
        yaxt = "n", ylim = c(0,2.2), cex.axis = 0.7)
mtext("C", cex = 1, side = 3, adj = 0, font = 2)
legend("topleft", c("Cluster 1", "Cluster 2"), pch = 15, col = colors, 
       bty = "n", box.col =  "white", cex = 0.9)
arrows(0.5, 2, 38, 2, angle = 90, code = 3)
arrows(39, 2, 76.5, 2, angle = 90, code = 3)
text(c(18.75, 57.5), c(1.6, 1.6), c("EvaGreen", "Hydrolysis probes"))
```


### ``head2tailratio()`` - A Function to Calculate the Ratio of the Head and the Tail of a Quantitative PCR Amplification Curve\label{section_head2tailratio}

The ratios from the ground and plateau phase can be used to search for patterns in amplification curves. Positive amplification curves have different slopes and intercepts at the start (head, background region) and the end (tail, plateau region) of the amplification curve. Hence, these regions are potentially useful to extract a predictor for amplification curve classification. Negative amplification curves (no slope) are assumed to have a ratio of about 1. In contrast, positive amplification curves should have a ratio of less than 1.

The $n$-dimensional space of all predictor variables $X_{1,2 \ldots n}$ is also called feature space. In the present study the feature space was extended by domain knowledge using known features. The ``head2tailratio()``-function is an example for this. Here the feature $X_{3}\equiv head2tail\_ratio$ could be calculated by determining the ratio of the $X_{1}\equiv$ *fluorescence intensity in the head region* and $X_{2}\equiv$ *fluorescence intensity in the tail region* of a quantitative PCR amplification curve ($X_{3} = \frac{X_{1}}{X_{2}}$). As ROI, the areas in the ground phase (head) and plateau phases (tail) are used (\autoref{amplification_curve_ROI}A). For the calculation, the median from the first six data points of the amplification curve and the median from the last six data points are used. The determination of six data points in both regions was made on the basis of \emph{empirical experience}. As a rule, no significant increase in amplification signals can be measured in the first six cycles and in the last six cycles (where the amplification curve usually transitions into the plateau). This assumption is sometimes violated (e.&nbsp;g., *hook effect*) and might lead to false estimates.


```{r, echo=TRUE}
library(PCRedux)

# Load the RAS002 amplification curve data set and assign it to the object data

data <- RAS002

# Load the RAS002 decision data set and assign it to the object data_decisions
data_decisions <- RAS002_decisions

# Calculate the head2tailratio of all amplification curves

res_head2tailratio <- lapply(2L:ncol(data), function(i) {
  head2tailratio(
    y = data[, i], normalize = TRUE, slope_normalizer = TRUE,
    verbose = TRUE
  )
})

# Fetch all values of the head2tailratio analysis for a later comparison
# by a boxplot.

res <- sapply(1L:length(res_head2tailratio), function(i)
  res_head2tailratio[[i]]$head_tail_ratio)

data_normalized <- cbind(
  data[, 1],
  sapply(2L:ncol(data), function(i) {
    data[, i] / quantile(data[, i], 0.99)
  })
)

# Assign colors to the classes (n: black, y: green).
colors <- as.character(factor(
  data_decisions, levels = c("y", "n"),
  labels = c(
      adjustcolor("green", alpha.f = 0.5), adjustcolor("black", alpha.f = 0.5))
))

res_wilcox.test <- stats::wilcox.test(res ~ data_decisions)
```

The amplification curves in \autoref{figure_head2tailratio} show a signal increase within the first three cycles, and those in \autoref{amplification_curve_ROI}C have a negative slope in the tail. The median is used to minimize the influence of outliers.


```{r figure_head2tailratio, echo=TRUE, fig.cap=figure_head2tailratio, fig.scap=figure_head2tailratio_short, fig.higth=3}
# Plot the results of the analysis
#
# Position and plot parameters
h <- max(na.omit(res))
h_text <- rep(h * 0.976, 2)

# Create graphic device for the plot(s)
layout(matrix(c(1,1,2), 1, 3, byrow = TRUE))

matplot(
  data_normalized[, 1], data_normalized[, -1],
  xlab = "Cycles", ylab = "normalized RFU", main = "",
  type = "l", lty = 1, lwd = 2, col = colors
)
for (i in 1L:(ncol(data_normalized) - 1)) {
  points(
    res_head2tailratio[[i]]$x_roi, res_head2tailratio[[i]]$y_roi,
    col = colors[i], pch = 19, cex = 1.5
  )
  abline(res_head2tailratio[[i]]$fit, col = colors[i], lwd = 2)
}
mtext("A", cex = 1, side = 3, adj = 0, font = 2)

# Boxplot of the head2tail ratios of the positive and negative
# amplification curves.

boxplot(res ~ data_decisions, col = unique(colors), ylab = "Head to Tail Ratio")

lines(c(1, 2), rep(h * 0.945, 2))
text(1.5, h_text, paste0("P = ", signif(res_wilcox.test[["p.value"]])), 
     cex = 1)

mtext("B", cex = 1, side = 3, adj = 0, font = 2)
```

In \autoref{section_DataAnalysis} and in \autoref{htPCR_nap}, it was shown that negative amplification curves may have a slope with a positive or negative sign. There is no consent in the literature and among peers how to deal with this during the processing. One solution is to include the slope as factor in the ratio calculation. The ``head2tailratio()`` function uses a linear model that calculates the slope between the ground and plateau phases. If the slope of the model is significant, then the ratio from the head and tail is normalized to this slope. This requires setting the `slope_normalizer` parameter in the ``head2tailratio()`` function. By default, this parameter is not set.

The `head2tail_ratio` values differ significantly between negative and positive amplification curves (\autoref{plot_bg_pt}K).


### ``hookreg()`` and ``hookregNL()`` - Functions to Detect Hook Effect-like Curvatures\label{section_hookreg}

``hookreg()`` and ``hookregNL()`` are functions to detect amplification curves bearing a *hook effect* [@barratt_improving_2002] or negative slope at the end of the amplification curve. Both functions calculate the slope and intercept of an amplification curve data. The assumption is that a strong negative slope at the end of an amplification curve is indicative for a *hook effect*. ``hookreg()`` and ``hookregNL()`` are part of a peer-reviewed publication [@burdukiewicz_algorithms_2018]. For this reason, the functions will not be discussed here.

### ``mblrr()`` - A Function to Perform the Quantile-filter Based Local Robust Regression \label{section_mblrr}


``mblrr()`` is a function to perform the \textbf{m}edian \textbf{b}ased \textbf{l}ocal \textbf{r}obust \textbf{r}egression (mblrr) from a quantitative PCR experiment. In detail, this function attempts to break the amplification curve in two ROIs (head (~background) and tail (~plateau)). As opposed to the ``earlyreg()`` function, the ``mblrr()`` function does not use a fixed interval. Instead, the ``mblrr()`` function dynamically determines cut points for each amplification curve. It was defined that:

- the 25% quantile is the value for which 25% of all values are smaller than this value. 
- the 75% quantile is the value for which 75% of all values are greater than this value. 

Subsequent, a robust linear regression analysis (``lmrob()``) is preformed individually on both regions of the amplification curve. The rationale behind this analysis is that the slope and intercept of an amplification curve differ in the background and plateau region. This is also shown by the simulations in \autoref{figure_sigmoid_curve_models}A-C. In the example shown below, the observations "P01.W19", "P06.W35", "P33.W66", "P65.W90", "P71.W23" and "P87.W01" were arbitrarily selected for demonstration purposes \autoref{plot_mblrr}. Those amplification curves have a slight negative trend in the base-line region and a positive trend in the plateau region.

The correlation coefficient\footnote{Product moment correlation coefficient (Pearson)} is a measure to quantify the dependence on variables (e.&nbsp;g., number of cycles, signal height). The correlation coefficient is always between -1 and 1, with a value close to -1 describing a strong-negative dependency and close to 1 describing a strong-positive dependency; if the value is 0, there is no dependency between the variables. The most frequently used correlation coefficient to describe a linear dependency is the Pearson correlation coefficient *r*. The correlation coefficient can be used as a predictor. Similar data structures have similar correlation coefficients. However, variables that are not strongly correlated can also be important for modeling.

```{r plot_mblrr, echo=TRUE, eval = TRUE, fig.cap=plot_mblrr, fig.scap=plot_mblrr_short, fig.height=7}
library(PCRedux)

# Select four amplification curves from the RAS002 data set
amplification_curves <- c(2, 3, 4, 5, 44, 45)
data <- RAS002[, c(1, amplification_curves)]

# Load the decision_res_htPCR.csv data set from a csv file.
filename <- system.file("decision_res_RAS002.csv", package = "PCRedux")
decision_res <- read.csv(filename)

# Overview of the amplicon curve classifications
res_decision <- decision_res[amplification_curves - 1, -c(3,4)]

res_decision

# Plot the regions and the linear regression line in the 
# amplification curve plot

colors <- c(
            adjustcolor("blue", alpha.f = 0.5),
            adjustcolor("orange", alpha.f = 0.8)
           )

# Create graphic device for the plot(s)
par(mfrow = c(3, 2))

for (i in 2L:ncol(data)) {
  x <- data[, 1]
  y_tmp <- data[, i] / quantile(data[, i], 0.99)
  res_q25 <- y_tmp < quantile(y_tmp, 0.25)
  res_q75 <- y_tmp > quantile(y_tmp, 0.75)
  res_q25_lm <- try(
    suppressWarnings(lmrob(y_tmp[res_q25] ~ x[res_q25])),
    silent = TRUE
  )
  res_q75_lm <- try(
    suppressWarnings(lmrob(y_tmp[res_q75] ~ x[res_q75])),
    silent = TRUE
  )

  plot(x, y_tmp, xlab = "Cycles", ylab = "RFU (normalized)",
    main = "", type = "b", pch = 19)
  
  mtext(paste0(LETTERS[i - 1], "   ", colnames(data)[i]), cex = 1, 
        side = 3, adj = 0, font = 2)
  legend("topleft", paste0(ifelse(res_decision[i - 1, 2] == "n",
                                  "negative", "positive")), 
         bty = "n")
  abline(res_q25_lm, col = colors[1])
  points(x[res_q25], y_tmp[res_q25], cex = 2.5, col = colors[1])
  abline(res_q75_lm, col = colors[2])
  points(x[res_q75], y_tmp[res_q75], cex = 2.5, col = colors[2])
}
```

Finally, the results of the analysis were printed in a tabular format.

```{r, eval=TRUE, echo=TRUE, results='asis'}
# Load the xtable library for an appealing table output
library(xtable)

# Analyze the data via the mblrr() function

res_mblrr <- do.call(cbind, lapply(2L:ncol(data), function(i) {
  suppressMessages(mblrr(
    x = data[, 1], y = data[, i],
    normalize = TRUE
  )) %>% data.frame()
}))
colnames(res_mblrr) <- colnames(data)[-1]

# Transform the data for a tabular output and assign the results to the object
# output_res_mblrr.

output_res_mblrr <- res_mblrr %>% t()

# The output variable names of the mblrr() function are rather long. For better
# readability the variable names were changed to "nBG" (intercept of the head 
# region), "mBG" (slope of the head region), "rBG" (Pearson correlation of head 
# region), "nTP" (intercept of the tail region), "mTP" (slope of tail region), 
# "rBG" (Pearson correlation of the tail region)

colnames(output_res_mblrr) <- c(
  "nBG", "mBG", "rBG",
  "nTP", "mTP", "rTP"
)

print(xtable(
  output_res_mblrr, caption = "Selected results of predictors from the mblrr() 
             function. nBG, intercept of head region; mBG, slope of head region; 
             rBG, Pearson correlation of head region; nTP, intercept of tail 
             region; mTP, slope of tail region; rBG, Pearson correlation of 
             tail region",
  label = "tablemblrrintroduction"
), comment = FALSE, caption.placement = "top")
```


### ``autocorrelation_test()`` - A Function to Detect Positive Amplification Curves \label{section_autocorrelation_test}

Autocorrelation analysis is a technique that is used in the field of time series analysis. It can be used to reveal regularly occurring patterns in one-dimensional data [@spiess_system-specific_2016]. Autocorrelation measures the correlation of a signal $f(t)$ with itself shifted by some time delay $f(t - \tau)$.

The ``autocorrelation_test()`` function coerces the amplification curve data to an object of class "zoo" (\texttt{zoo} package) as indexed totally ordered observations. Then follows the computation of a lagged version of the amplification curve data. The shifting of the amplification curve data is based on the number of observations (number of cycles 'c') with the following $\tau$.

Number of Cycles (c) | $\tau$
---------------------|----------------
$c \leq 35$          | 8
$35 > c \leq 40$     | 10
$40 < c \leq 45$     | 12
$c > 45$             | 14


This is followed by a significance test for correlation between paired observations (original & lagged amplification curve data). The hypothesis is that paired observations of positive amplification curves exhibit significant correlation (`stats::cor.test`, significance level is 0.01) in contrast to negative amplification curves (noise). The application of the ``autocorrelation_test()`` function is shown in the following example.

In addition, the decisions file `decision_res_RAS002.csv` from the user was analyzed for the most frequent decision (modus) using the ``decision_modus()`` function (\autoref{section_decision_modus}).


```{r autocorrelation, echo=FALSE, fig.cap=autocorrelation, fig.scap=autocorrelation_short, fig.height=5.5, fig.width=11}
# Test for autocorrelation in amplification curve data. The amplification 
# curve data from the `htPCR` data set was used. 
library(PCRedux)

# Load the decision_res_htPCR.csv data set from a csv file.
filename <- system.file("decision_res_htPCR.csv", package = "PCRedux")
decision_res_htPCR <- read.csv(filename)

# Select only amplification curves (obs) were all 
# classifications were in concordance ("conformity == TRUE" ).
# This subset of the htPCR data set contains the classes 
# n: negative
# a: ambiguous
# p: positive

obs_number <- which(decision_res_htPCR[["conformity"]] == TRUE)
dec <- unlist(decision_res_htPCR[obs_number, "test.result.1"])

# Give tabular output of classes
table(dec)

# Since the number of ambiguous is low (n = 2), they were re-assigned to the 
# class negative

dec[dec == "a"] <- "n"

table(dec)

# Load only the amplification curves from the htPCR data set that were 
# uniquely assigned to one class (e. g., eight out of eight positive).

data <- qpcR::htPCR[, c(1, obs_number + 1)]

# Assign colors to the classes (n: black, y: green).

colors <- factor(dec, levels = c("y", "n"), label = c("black", "green"))


# Test for autocorrelation in the subset of the htPCR data set

res_ac <- sapply(2:ncol(data), function(i) {
    autocorrelation_test(data[, i])
})


# Plot curve data as overview
# Names of the observations
# Create graphic device for the plot(s)

layout(matrix(c(1, 2, 3, 1, 4, 4), 2, 3, byrow = TRUE))
matplot(
    data[, 1], data[, -1], xlab = "Cycles", ylab = "RFU",
    main = "", type = "l", lty = 1,
    col = colors, lwd = 2
)
legend("topleft", c("positive", "negative"), pch = 19, col = c(1, 2), bty = "n")
mtext("A    RAS002 data set", cex = 1, side = 3, adj = 0, font = 2)


# Convert the n.s. (not significant) in 0 and others to 1.
# Combine the results of the aromatic autocorrelation_test as variable "ac",
# the human classified values as variable "hc" in a new data frame (res_ac_hc).

cutoff <- 0.9

res_ac_hc <- as.matrix(data.frame(
    ac = ifelse(res_ac > cutoff, 1, 0),
                        hc = ifelse(dec == "y", 0, 1)
))

res_performeR <- performeR(s = res_ac_hc[, "ac"], r = res_ac_hc[, "hc"])

plot(density(res_ac), xlab = "Autocorrelation", ylab = "Density", main = "")
rug(res_ac)

abline(v = cutoff)
mtext("B", cex = 1, side = 3, adj = 0, font = 2, las = 0)


cdplot(res_ac, as.factor(dec), xlab = "Autocorrelation", ylab = "Class decision")

mtext("C", cex = 1, side = 3, adj = 0, font = 2, las = 0)

position_bp <- barplot(
    as.matrix(res_performeR[, c(1:10, 12)]), yaxt = "n", ylab = "",
        main = "Performance of autocorrelation_test",
        col = adjustcolor("grey", alpha.f = 0.5), border = "white"
)
text(position_bp, rep(0.8, length(res_performeR[, c(1:10, 12)])), 
     paste(signif(res_performeR[, c(1:10, 12)], 2)*100, "%"))
axis(2, at = c(0, 1), labels = c("0", "1"), las = 2)
mtext("D", cex = 1, side = 3, adj = 0, font = 2, las = 0)
```

As shown in this example, the ``autocorrelation_test()`` function is able to distinguish between positive and negative amplification curves. Negative amplification curves were in all cases non-significant. In contrast, the coefficients of correlation for positive amplification curves ranged between `r signif(range(as.numeric(res_ac), na.rm = TRUE), 3)[1]` and `r signif(range(as.numeric(res_ac), na.rm = TRUE), 3)[2]` at a significance level of 0.01 and a lag of 3.

```{r plot_peaks_ratio, eval=FALSE, echo=FALSE, fig.cap=plot_peaks_ratio, fig.scap=plot_peaks_ratio_short, fig.height=8}
# Create graphic device for the plot(s)
par(mfrow = c(2,1))

for(i in 1:2){
dat_smoothed <- chipPCR::smoother(RAS002[, 1], RAS002[, i+1])
res_diffQ <- suppressMessages(MBmca::diffQ(cbind(RAS002[-c(1,2), 1], dat_smoothed[-c(1,2)]), verbose = TRUE)$xy)
res_mcaPeaks <- MBmca::mcaPeaks(res_diffQ[, 1], res_diffQ[, 2])

range_p.max <- range(res_mcaPeaks$p.max[, 2])
diff_range_p.max <- diff(range_p.max)


range_p.min <- range(res_mcaPeaks$p.min[, 2])
diff_range_p.min <- diff(range_p.min)


peaks_ratio <- diff_range_p.max / diff_range_p.min


plot(res_diffQ, xlab = "Cycles", ylab = "dRFU/dCycle", type = "b", 
    ylim = c(diff_range_p.min, diff_range_p.max))

points(res_mcaPeaks$p.max, col = "red", pch = 19)
points(res_mcaPeaks$p.min, col = "cyan", pch = 19)

arrows(mean(res_mcaPeaks$p.max[, 1]), 
       range(res_mcaPeaks$p.max[, 2])[[1]], 
       mean(res_mcaPeaks$p.max[, 1]), 
       range(res_mcaPeaks$p.max[, 2])[[2]], col = "red", angle = 90, code = 3)

points(mean(res_mcaPeaks$p.max[, 1]), diff_range_p.max, col = "red", pch = 13, cex = 2)

arrows(mean(res_mcaPeaks$p.min[, 1]), 
       range(res_mcaPeaks$p.min[, 2])[[1]], 
       mean(res_mcaPeaks$p.min[, 1]), 
       range(res_mcaPeaks$p.min[, 2])[[2]], col = "cyan", angle = 90, code = 3)
       
points(mean(res_mcaPeaks$p.min[, 1]), diff_range_p.min, col = "cyan", pch = 13, cex = 2)

mtext(LETTERS[i], cex = 1, side = 3, adj = 0, font = 2)


legend("bottomleft", c("Local maxima", "Mean local maxima", 
"Local minima", "Mean local minima", paste("peaks_ratio", signif(diff_range_p.max/diff_range_p.min, 2))), 
                       col = c("red", "cyan", "red", "cyan", "black"), pch = c(19, 13, 19, 13, 1), 
                       bty = "n")
}
```


```{r plot_cp_area, echo=FALSE, fig.cap=plot_cp_area, fig.scap=plot_cp_area_short, fig.height=5}
library(PCRedux)
# Select the data sets for the comparison
data <- data_sample_subset

# Select the predictors for the comparison
predictor <- c("polyarea", "peaks_ratio", "cp_e.agglo",
             "cp_bcp", "amptester_polygon",
             "amptester_slope.ratio")

# Create a vector of the decisions (classes)
x <- data$decision

# Create graphic device for the plot(s)
par(mfrow = c(2,3))

for(i in 1L:length(predictor)) {
    y <- data[, colnames(data) == predictor[i]]
    res <- stats::wilcox.test(y ~ x)
    h <- max(na.omit(y))
    l <- min(na.omit(y))
    h_text <- rep(h * 0.976, 2)
    
    par(bg=NA)
    stripchart(y ~ x, vertical = TRUE, ylab = predictor[i],
               method = "jitter", pch = 19, cex = 1, 
               col = adjustcolor("darkgrey", alpha.f = 0.65), 
               ylim = c(l * 0.95, h * 1.05))
    
    boxplot(y ~ x, outline = FALSE, add = TRUE, boxwex = 0.35)

    legend("topleft", paste0("P = ", signif(res[["p.value"]])), 
           cex = 1, bty = "n")
    
    mtext(paste0(LETTERS[i], "   ", predictor[i]), cex = 1, side = 3, 
          adj = 0, font = 2, col = ifelse(signif(res[["p.value"]]) < 0.05, 
                                          "black", "red"))
}
```


### Frequentist and Bayesian Change Point Analysis\label{section_change_point_analysis}

Change point analysis (CPA) encompasses methods to identify or estimate single or multiple locations of distributional changes in a series of data points indexed in time order. A change herein refers to a statistical property. CPA is used for example in econometrics and bioinformatics [@Killick_2014; @erdman_bcp:_2007]. Several change point algorithms exist, such as the binary segmentation algorithm [@scott_cluster_1974]. In change point analysis one assumes independent ordered observations $x_{1}, x_{2}, \ldots, x_{n} \in \mathbb{R}^{\textit{d}}$ [@james_ecp:_2013]. In the case of qPCR, this is simply the cycle-dependent fluorescence, used to create $k$ homogeneous subsets of unknown size [@erdman_bcp:_2007]. While frequentist methods make an estimation of the parameter at the location (e.&nbsp;g., mean) of the change points at specific points, change point analysis using Bayesian method produces a probability for the occurrence of a change point at certain points. For the analysis of the amplification curves, it was hypothesized that the number of change points differs between positive (sigmoidal) and negative (noise) amplification curves.

The ``pcrfit_single()`` function uses two independent approaches for change point analysis. These are the ``bcp()`` [\texttt{bcp}]  [@erdman_bcp:_2007] and the ``e.agglo()`` [\texttt{ecp}] function [@james_ecp:_2013]. The ``e.agglo()`` function performs a non-parametric change point analysis based on agglomerative hierarchical estimation and is useful to "detect changes within the marginal distributions" [@james_ecp:_2013]. Measurements from the qPCR systems typically show noise that has rapidly changing components. Differentiators amplify these rapidly changing noise components [@roediger_RJ_2013]. Therefore, the first derivation of the amplification curve was used for both change point analyses. It was assumed for the change point analysis of amplification curves, that this leads to larger differences between positive and negative amplification curves. An example is shown on \autoref{plot_cpa}. In contrast the ``bcp()`` [\texttt{bcp}] function performs a change point analysis based on a Bayesian approach. This method can detect changes in the mean of independent Gaussian observations. As a result, the analysis returns the posterior probability of a change point at each $x_{i}$. An example is shown on \autoref{plot_cpa}. Both the change point analysis methods provide additional information to distinguish positive and negative amplification curves \autoref{plot_cp_area}E & F).

```{r plot_cpa, echo=FALSE, fig.cap=plot_cpa,fig.scap=plot_cpa_short, fig.height=5}

# Analyze a positive and a negative amplification curve from the `RAS002` data set 
# for change points using the `bcp` and `ecp` packages.
# 
library(bcp)
library(ecp)

# The MBmca package is used to calculate the approximate first derivative 
# of the amplification curve.
library(MBmca)

index <- which(grepl("B.Globin", colnames(RAS002)))

data <- RAS002[, c(1, index)]

amp_data <- data[, c(1,25, 2)]

# Smooth data with moving average for other data
# analysis steps.
dat_smoothed <- cbind(
    chipPCR::smoother(amp_data[, 1], amp_data[, 2]),
                      chipPCR::smoother(amp_data[, 1], amp_data[, 3])
)

# Calculate the first derivative
dat_smoothed_deriv <- cbind(
    suppressMessages(MBmca::diffQ(cbind(amp_data[-c(1,2), 1], dat_smoothed[-c(1,2), 1]), verbose = TRUE)$xy)[, 2],
                            suppressMessages(MBmca::diffQ(cbind(amp_data[-c(1,2), 1], dat_smoothed[-c(1,2), 2]), verbose = TRUE)$xy)[, 2]
)


# Bayesian analysis of change points
# Positive amplification curve
res_bcp_pos <- bcp(dat_smoothed_deriv[, 1])

y2_range <- range(res_bcp_pos$posterior.prob, na.rm = TRUE)

# Negative amplification curve
res_bcp_neg <- bcp(dat_smoothed_deriv[, 2])


# Change point analysis by energy agglomerative clustering

# Positive amplification curve
res_ecp_pos <- ecp::e.agglo(as.matrix(dat_smoothed_deriv[, 1]))$estimates
# Negative amplification curve
res_ecp_neg <- ecp::e.agglo(as.matrix(dat_smoothed_deriv[, 2]))$estimates

# Create graphic device for the plot(s)
par(mfrow = c(2, 3))

plot(amp_data[, 1], amp_data[, 2], xlab = "Cycles", ylab = "RFU", type = "l", lwd = 2)
mtext("A    Negative", cex = 1, side = 3, adj = 0, font = 2)

plot(res_bcp_pos$data, xlab = "Cycles", ylab = "d(RFU) / d(cycle)", type = "l", lwd = 2)
mtext("B 1st Derivative", cex = 1, side = 3, adj = 0, font = 2)

plot(res_bcp_pos$posterior.prob, xlab = "Cycles", ylab = "Probability", type = "b", lwd = 2, pch = 19, col = "red", ylim = c(0, 1))
res_ecp_pos <- ecp::e.agglo(as.matrix(dat_smoothed_deriv[, 1]))$estimates
abline(v = res_ecp_pos, col = "green", pch = 19)
abline(h = 0.6, col = "grey")
mtext("C   Changepoints", cex = 1, side = 3, adj = 0, font = 2)
legend("topright", c("Bayesian", "Agglomerative"), pch = c(19, 19), 
       col = c("red", "green"), bty="n")

plot(amp_data[, 1], amp_data[, 3], xlab = "Cycles", ylab = "RFU", type = "l", lwd = 2)
mtext("D    Positive", cex = 1, side = 3, adj = 0, font = 2)

plot(res_bcp_neg$data, xlab = "Cycles", ylab = "d(RFU) / d(cycle)", type = "l", lwd = 2)
mtext("E    1st Derivative", cex = 1, side = 3, adj = 0, font = 2)

plot(res_bcp_neg$posterior.prob, xlab = "Cycles", ylab = "Probability", type = "b", lwd = 2, pch = 19, col = "red", ylim = c(0, 1))
res_ecp_neg <- ecp::e.agglo(as.matrix(dat_smoothed_deriv[, 2]))$estimates
abline(v = res_ecp_neg, col = "green", pch = 19)
abline(h = 0.6, col = "grey")
mtext("F   Changepoints", cex = 1, side = 3, adj = 0, font = 2)
legend("topleft", c("Bayesian", "Agglomerative"), pch = c(19, 19), 
       col = c("red", "green"), bty="n")
```

### `windower` - Analysis of multiple regions of interest\label{windower}

The `windower` feature is embedded in the ``pcrfit_single()`` function. It is a 
function to calculate the median absolute deviation of several sections of an 
amplification curve. Therefore, the amplification curve is divided into 10 
windows. Each window always contains five data points (cycles) of RFU values, from 
which the median absolute deviation is calculated. To ensure this, each 
amplification curve is adjusted with a spline. From this spline a model is 
calculated from which exactly 50 points are calculated (interpolated), 
independent of the cycle range.

### Frequentist Approaches to Test the Class of an Amplification Reaction and Application of the ``amptester()`` Predictors\label{section_amptester}

A part of ``pcrfit_single()`` is the ``amptester()`` [\texttt{chipPCR}] function, which contains tests to determine whether an amplification curve is positive or negative. The input values for the function differ due to the different preprocessing steps in the ``pcrfit_single()`` function. Therefore, the concepts of the tests are briefly described below.

- The first test, designated as SHt, is based on this Shapiro-Wilk test of normality. This relatively simple procedure can be used to check whether the underlying population of a sample (amplification curve) is significantly ($\alpha \leq 5e-04$) normal distributed. The name of the output of the ``pcrfit_single()`` function is `amptester_shapiro`.
- The second test is the *Resids growth test* (RGt), which tests if the fluorescence values in linear phase are stable. Whenever no amplification occurs, fluorescence values quickly deviate from linear model. Their standardized residuals will be strongly correlated with their value. For real amplification curves, the situation is much more stable. Noise (meaning deviations from linear model) in  background do not correlate strongly with the changes in fluorescence. The decision is based on the threshold value (here 0.5). The output is binary coded (negative = 0, positive = 1). The output name of the ``pcrfit_single()`` function is `amptester_rgt`.
- The third test is the *Linear Regression test* (LRt). This test determines the coefficient of determination ($R^{2}$) by an ordinary least squares linear (OLS) regression. The $R^{2}$ are determined from a run of $\sim$ 15% range of the data. If a sequence of more than six $R^{2}$s larger than 0.8 is found, a nonlinear signal is plausible. This is somewhat counter-intuitive, because $R^{2}$ of nonlinear data should be low. The output is binary coded (negative = 0, positive = 1). The output name of the ``pcrfit_single()`` function is `amptester_lrt`.
- The fourth test is called *Threshold test* (THt), based on the Wilcoxon rank sum test. As a simple rule the first 20% (head) and the last 15% (tail) of an amplification curve are used as input data. From this, a one-sided Wilcoxon rank sum tests of the head versus the tail is performed ($\alpha \leq 1e-02$). The output is binary coded (negative = 0, positive = 1). The output name of the ``pcrfit_single()`` function is `amptester_tht`.
- The fifth test is called *Signal level test* (SLt). The test compares the signals of the head and the tail by a robust "sigma" rule (median + 2 * MAD) and the comparison of the head/tail ratio. If the returned value is less than 1.25 (25 percent), then the amplification curve is likely negative. The output is binary coded (negative = 0, positive = 1). The output name of the ``pcrfit_single()`` function is `amptester_slt`.
- The sixth test is called *Polygon test* (pco). The pco test determines if the points in an amplification curve (like a polygon) are in a "clockwise" order. The sum over the edges result in a positive value if the amplification curve is "clockwise" and is negative if the curve is counter-clockwise. Experience states that noise is positive and "true" amplification curves are "highly" negative. In contrast to the implementation in the ``amptester()`` function, the result is normalized by a division to the number of PCR cycles. The output is numeric. The output name of the ``pcrfit_single()`` function is `amptester_polygon`.
- The seventh test is the *Slope Ratio test* (SlR). This test uses the approximated first derivative maximum, the second derivative minimum and the second derivative maximum of the amplification curve. Next, the raw fluorescence at the approximated second derivative minimum and the second derivative maximum are taken from the original data set. The fluorescence intensities are normalized to the maximum fluorescence of this data and then employed in a linear regression, using the estimated slope. The output is numeric and the output name of the ``pcrfit_single()`` function is `amptester_slope.ratio`.

## Classified Amplification Curve Datasets\label{chapter_humanrater}

Amplification curves from different sources (e.&nbsp;g., detection chemistries, thermo-cyclers) were manually classified with the ``humanrater()`` function (\autoref{Manual_Amplification_Curve_Classification}) or with the ``tReem()`` function  (\autoref{shape_Amplification_Curve_Classification}). Raw amplification curve data were exported as comma separated values or in the Real-time PCR Data Markup Language (RDML) format via the \texttt{RDML} package. RDML is human readable data exchange format for qPCR experiments. A detailed description can be found in @roediger_enabling_2017. The following code section describes the import of an RDML file from the \texttt{PCRedux} package. The RDML file contains amplification curve data of a duplex qPCR (HPV 16 & HPV 18) performed in the CFX96 (Bio-Rad).

At least the following datasets with a dichotomous decision (positive, negative) are included.

- decision_res_batsch1.csv
- decision_res_batsch2.csv
- decision_res_batsch3.csv
- decision_res_batsch4.csv
- decision_res_batsch5.csv
- decision_res_boggy.csv
- decision_res_C126EG595.csv
- decision_res_C127EGHP.csv
- decision_res_C316.amp.csv
- decision_res_C317.amp.csv
- decision_res_C60.amp.csv
- decision_res_CD74.csv
- decision_res_competimer.csv
- decision_res_dil4reps94.csv
- decision_res_guescini1.csv
- decision_res_guescini2.csv
- decision_res_HCU32_aggR.csv
- decision_res_htPCR.csv
- decision_res_karlen1.csv
- decision_res_karlen2.csv
- decision_res_karlen3.csv
- decision_res_lc96_bACTXY.csv
- decision_res_lievens1.csv
- decision_res_lievens2.csv
- decision_res_lievens3.csv
- decision_res_RAS002.csv
- decision_res_RAS003.csv
- decision_res_reps2.csv
- decision_res_reps384.csv
- decision_res_reps3.csv
- decision_res_reps.csv
- decision_res_rutledge.csv
- decision_res_stepone_std.csv
- decision_res_testdat.csv
- decision_res_vermeulen1.csv
- decision_res_vermeulen2.csv
- decision_res_VIMCFX96_60.csv


```{r, echo=TRUE, eval=FALSE}
library(RDML)
# Load the RDML package and use its functions to import the amplification curve
#  data
library(RDML)
filename <- system.file("RAS002.rdml", package = "PCRedux")
raw_data <- RDML$new(filename = filename)
```

The following example shows the export of the `RAS002.rdml` file from the RDML format to the csv format.

```{r, echo=TRUE, eval=FALSE}
# Export the RDML data from the PCRedux package as the objects RAS002 and RAS003.
library(RDML)
library(PCRedux)

RAS002 <- data.frame(RDML$new(paste0(
    path.package("PCRedux"), "/", "RAS002.rdml"))$GetFData()
    )

# The obbject RAS002 can be stored in the working directory as CSV file with
# the name RAS002_amp.csv.
write.csv(RAS002, "RAS002_amp.csv", row.names = FALSE)
```

RDML data file           | Device    | Target gene           | Detection chemistry
-------------------------|-----------|-----------------------|------------------------------
RAS002.rdml              | CFX96     | HPV16, HPV18, HPRT1   | TaqMan
RAS003.rdml              | CFX96     | HPV16, HPV18, HPRT1   | TaqMan
hookreg.rdml             | Bio-Rad   | various               | TaqMan, DNA binding dyes

\begin{table}[]
\centering
\caption{Classified amplification curve data sets. Decision data sets in \texttt{PCRedux}: table with results of manual classification as comma separated values. qPCR data set: name of original amplification curve data set. Package: name of the \texttt{R} package containing the amplification curves. Device: is the device used to measure the amplification reaction. \texttt{Note: The original data sets contain information about the detection chemistry used within the corresponding qPCR experiments.} AB, Applied Biosystems.}
\label{table-datasets}
\begin{tabular}{llll}
\hline
Decision Datasets in PCRedux                    & qPCR Dataset               & Package              & Device         \\ \hline
decision\_res\_RAS002.csv                        & RAS002.rdml                 & \texttt{PCRedux}   & CFX96, Bio-Rad \\
decision\_res\_RAS003.csv                        & RAS003.rdml                 & \texttt{PCRedux}   & CFX96, Bio-Rad \\
decision\_res\_batsch1.csv                       & batsch1                     & \texttt{qpcR}      & Light Cycler 1.0, Roche \\
decision\_res\_batsch2.csv                       & batsch2                     & \texttt{qpcR}      & Light Cycler 1.0, Roche \\
decision\_res\_batsch3.csv                       & batsch3                     & \texttt{qpcR}      & Light Cycler 1.0, Roche \\
decision\_res\_batsch4.csv                       & batsch4                     & \texttt{qpcR}      & Light Cycler 1.0, Roche \\
decision\_res\_batsch5.csv                       & batsch5                     & \texttt{qpcR}      & Light Cycler 1.0, Roche \\
decision\_res\_lc96\_bACTXY.csv                  & lc96\_bACTXY.rdml           & \texttt{RDML}      & Light Cycler 1.0, Roche \\
decision\_res\_boggy.csv                         & boggy                       & \texttt{qpcR}      & Light Cycler 96, Roche \\
decision\_res\_C126EG595.csv                     & C126EG595                   & \texttt{chipPCR}   & Chromo4, Bio-Rad \\
decision\_res\_C127EGHP.csv                      & C127EGHP                    & \texttt{chipPCR}   & iQ5, Bio-Rad \\
decision\_res\_C316.amp.csv                      & C316.amp                    & \texttt{chipPCR}   & iQ5, Bio-Rad \\
decision\_res\_C317.amp.csv                      & C317.amp                    & \texttt{chipPCR}   & iQ5, Bio-Rad \\
decision\_res\_C60.amp.csv                       & C60.amp                     & \texttt{chipPCR}   & iQ5, Bio-Rad \\
decision\_res\_CD74.csv                          & CD74                        & \texttt{chipPCR}   & iQ5, Bio-Rad \\
decision\_res\_competimer.csv                    & competimer                  & \texttt{qpcR}      & Light Cycler 480, Roche \\
decision\_res\_dil4reps94.csv                    & dil4reps94                  & \texttt{qpcR}      & CFX384, Bio-Rad \\
decision\_res\_guescini1.csv                     & guescini1                   & \texttt{qpcR}      & Light Cycler 480, Roche \\
decision\_res\_guescini2.csv                     & guescini2                   & \texttt{qpcR}      & Light Cycler 480, Roche \\
decision\_res\_htPCR.csv                         & htPCR                       & \texttt{qpcR}      & Biomark HD, Fluidigm \\
decision\_HCU32\_aggR.csv                        & HCU32\_aggR.csv             & \texttt{PCRedux}   & VideoScan \\
decision\_res\_karlen1.csv                       & karlen1                     & \texttt{qpcR}      & ABI Prism 7700, AB \\
decision\_res\_karlen2.csv                       & karlen2                     & \texttt{qpcR}      & ABI Prism 7700, AB \\
decision\_res\_karlen3.csv                       & karlen3                     & \texttt{qpcR}      & ABI Prism 7700, AB \\
decision\_res\_lievens1.csv                      & lievens1                    & \texttt{qpcR}      & ABI7300, ABI \\
decision\_res\_lievens2.csv                      & lievens2                    & \texttt{qpcR}      & ABI7300, ABI \\
decision\_res\_lievens3.csv                      & lievens3                    & \texttt{qpcR}      & ABI7300, ABI \\
decision\_res\_reps.csv                          & reps                        & \texttt{qpcR}      & MXPro3000P, Stratagene \\
decision\_res\_reps2.csv                         & reps2                       & \texttt{qpcR}      & MXPro3000P, Stratagene \\
decision\_res\_reps3.csv                         & reps3                       & \texttt{qpcR}      & MXPro3000P, Stratagene \\
decision\_res\_reps384.csv                       & reps384                     & \texttt{qpcR}      & CFX384, Bio-Rad \\
decision\_res\_rutledge.csv                      & rutledge                    & \texttt{qpcR}      & Opticon 2, MJ Research \\
decision\_res\_stepone\_std.csv                  & stepone\_std                & \texttt{RDML}      & StepOne, AB
\\
decision\_res\_testdat.csv                       & testdat                     & \texttt{qpcR}      & Light Cycler 1.0, Roche \\
decision\_res\_vermeulen1.csv                    & vermeulen1                  & \texttt{qpcR}      & Light Cycler 480, Roche \\
decision\_res\_vermeulen2.csv                    & vermeulen2                  & \texttt{qpcR}      & Light Cycler 480, Roche \\
decision\_res\_VIMCFX96\_60.csv                  & VIMCFX96\_60                & \texttt{chipPCR}   & CFX96, Bio-Rad \\
decision\_res\_kbqPCR                            & kbqPCR                      & \texttt{PCRedux}   & CFX96, Bio-Rad \\ \hline
\end{tabular}
\end{table}


## Technologies for Amplification Curve Classification and Classified Amplification Curves\label{section_data_sets}

Many machine learning concepts exist. One method is supervised machine learning, where the goal is to derive a property from user-defined (classified) training data. Categories such as negative, ambiguous or positive are assigned depending on the form of the amplification curve. An extensive literature research showed that there are no openly accessible classified amplification curve data sets. Open Data is meant in the sense that data are freely available, free of charge, free to use and that data can be republished, without restrictions from copyright, patents or other mechanisms of control [@kitchin2014].

Therefore, a large number of records with amplification curves and their classification (negative, ambiguous, positive) were added to the \texttt{PCRedux} package.

The following is a list of where the labeling data records with the classes negative, ambiguous and positive can be found and called up. In order to save storage capacity, the original amplification data sets were left in the corresponding R-packages. All data are freely available under open source licenses.
Note: The \texttt{miRcompData} package \url{https://bioconductor.org/packages/release/data/experiment/html/miRcompData.html} ([@kemperman_mircomp-shiny:_2017]) contains a very large set of gene expression data. However, no label data is available for this one.

For the amplification curves in \autoref{table-datasets}, a dichotomous classification was performed (roughly sigmoid or negative amplification reaction with a flat curve shape). Consequently, this does not rule out 

- if an unspecific amplification product has been synthesized, 
- if a contamination has been amplified or 
- if only primer-dimers have been amplified. 

To answer this question, other methods such as agarose gel electrophoresis need to be used.

### Manual Amplification Curve Classification\label{Manual_Amplification_Curve_Classification}

For machine learning and method validation, it was important to classify the amplification curves individually. In @roediger2015chippcr, the ``humanrater()`` [\texttt{chipPCR}] function was described. This function was developed to help the user during the classification of amplification curves and melting curves. The user has to define classes, which get assigned to an amplification curve after an expert has entered the class in input mask. 

> By convention, class labels were specified as e.&nbsp;g., negative ("n"), ambiguous ("a"), positive ("y") in the \texttt{PCRedux} package. 

All amplification curve data sets listed in \autoref{table-datasets} were classified in interactive, semi-blinded sessions. ``humanrater()`` [\texttt{chipPCR}] was set to randomly select individual amplification curves. All data sets were manually classified at least three times. The `htPCR` data set (\autoref{figure_sigmoid_curve}B) was classified eight times in total (see \autoref{figure_curve_classification}). Most of the amplification curves are neither unequivocal classifiable as positive or negative.


```{r figure_curve_classification, echo=TRUE, eval=TRUE, fig.cap=figure_curve_classification, fig.scap=figure_curve_classification_short}
# Suppress messages and load the packages for reading the data of the classified
# amplification curves.
library(PCRedux)

# Load the decision_res_htPCR.csv data set from a csv file.
filename <- system.file("decision_res_htPCR.csv", package = "PCRedux")
decision_res_htPCR <- read.csv(filename)

# Create graphic device for the plot(s)
par(mfrow = c(2, 4))
for (i in 2L:9) {
    data_tmp <- table(as.factor(decision_res_htPCR[, i]))
    
    barplot(data_tmp, col = adjustcolor("grey", alpha.f = 0.5),
            xlab = "Class", ylab = "Counts", border = "white")
    text(c(0.7, 1.9, 3.1), rep(quantile(data_tmp, 0.25), 3), 
         data_tmp, srt = 90)
    mtext(LETTERS[i - 1], cex = 1, side = 3, adj = 0, font = 2)
}
```

This approach is well suited and has been applied to classify a variety of amplification curves during the development of the \texttt{PCRedux} package. From experience this is time-consuming and tiring for large data sets, especially when the amplification curves are similar in shape. A high similarity between amplification curves exists, for example, in replicates and negative controls.

### ``tReem()`` - A Function for Shape-based Group-wise Classification of Amplification Curves\label{shape_Amplification_Curve_Classification}

The similarity of amplification curves can be used to form groups of similar shapes. The amplification curves in the groups can then be classified in bulk. In this way, a higher throughput can be achieved, a concept yet not described for the analysis of qPCR data in the literature. 

The ``tReem()`` function was developed to perform a *shape-based group classification*. To use the ``tReem()`` function, the first column must contain the qPCR cycles and all subsequent columns must contain the amplification curves. Two measures of similarity are used within the ``tReem()`` function.

- In the first measure (default), the  Pearson *correlation coefficient*s (*r*) are determined in pairs for all combinations of the amplification curves. The correlation coefficient is a statistical measure to describe the strength of the correlation between two or more variables. The correlation coefficient *r* is regarded as distance between the amplification curves. *r* is a dimensionless value and only takes values between -1 and 1. If *r = -1*, there is a maximum reciprocal relationship. If *r = 0* there is no correlation between the two variables. If *r = 1*, there is a maximum rectified correlation.

- In the second measure, the *Hausdorff distance* is used to determine the similarity between amplification curves. The Hausdorff distance is "the maximum of the distances from a point in any of the sets to the nearest point in the other set" [@rote_computing_1991; @herrera_multiple_2016]. The amplification curves are converted within the ``tReem()`` function using the ``qPCR2data()`` function.

Both methods process the distances in the same steps. This involves the calculation of the distance matrix using the Euclidean distances of all distance measures to determine the distance between the lines of the data matrix. 

This is used to perform a hierarchical cluster analysis. In the last step, the cluster is divided into groups based on a user-defined *k* value. For example, two groups are created for *k = 2*. If the amplification curves shapes are highly diverse, a larger *k* should be used. After a chain of processing steps presents the ``tReem()`` function a series of plots with grouped of amplification curves. The corresponding classes can then be assigned to the groups of amplification curves by the user using an input mask.

Grouping the amplification curves with the Pearson correlation coefficient as a distance measure is usually faster than the Hausdorff distance. The Hausdorff distance is an approximation of a shape metrics to define similarity measures between shapes. [@charpiat_shape_2003].


```{r, echo=TRUE, eval=FALSE}
# Classify amplification curve data by correlation coefficients (r)
data <- qpcR::testdat

classification_result <- tReem(data[, 1:15], k = 3)
classification_result
```

### ``decision_modus()`` - A Function to Get a Decision (Modus) from a Vector of Classes \label{section_decision_modus}

For the systematic statistical analysis of classification data sets, the ``decision_modus()`` function has been developed. This allows the most common decision (mode) to be determined. The mode is useful to consolidate large collections of different decisions into a single (most frequent) decision. 

> Observed:*a*, *a*, *a*, *a*, *a*, *n*, *n*, *n* $\rightarrow$ frequencies 5 x *a*, 3 x *n* $\rightarrow$ mode: *a*. Since the class names are known, they only have to be interpreted by the user (e.&nbsp;g., "a",*n*,"y" -> "ambivalent","negative","positive"). 

A manual classification was performed out for the `htPCR` data set (for an example plot \autoref{figure_sigmoid_curve}B) with the ``humanrater()`` function. The classification of each amplification curve was performed eight times at different time points since many of the amplification curves did not resemble optimal curvatures (e.&nbsp;g., \autoref{htPCR_nap}). It is likely that the amplification curve (`r colnames (htPCR)[512]`, \autoref{htPCR_nap}) is considered as ambiguous or even positive (positive $\leftrightarrow$ ambivalent) by the users. 

\autoref{tableheaddecision} shows from a total of `r ncol(qpcR:: htPCR)-1` amplification curves the first 25 lines classified as negative (*conformity=TRUE*) and the first 25 lines classified as positive. Since in total, the curves were classified eight times (`test.result.1` $\ldots$ `test.result.8`) a whole of 70864 amplification curves was analysed. In this classification experiment the amplification curves have been classified differently in `r paste0(signif(sum(decision_res_htPCR$conformity == FALSE) / nrow(decision_res_htPCR) * 100, 3), "%")` of the cases (e.&nbsp;g., line 1 "P01. W01").


```{r, eval=TRUE, echo=FALSE, results='asis'}
library(PCRedux)
library(xtable)

print(
    xtable(
        rbind(
            head(subset(decision_res_htPCR, conformity == FALSE), 25),
              head(subset(decision_res_htPCR, conformity == TRUE), 25)), 
           caption = "Results of the `htPCR` data set classification. 
           All amplification curves of the `htPCR` data set were classified as 
           `negative`, `ambiguous` and `positive` by individuals in eight 
           analysis cycles (`test.result.1` $\\ldots$ `test.result.8`). If an 
           amplification curve was always classified with the same class, 
           the last column (`conformity`) shows `TRUE`. As an example, the table 
           shows 25 amplification curves with consistent classes and 25 
           amplification curves with differing classes (`conformity = FALSE`).", 
           label = "tableheaddecision" ), include.rownames = FALSE, 
      comment = FALSE, caption.placement = "top", size = "\\tiny"
)
```

The ``decision_modus()`` function was applied to the record `decision_res_htPCR.csv` with all classification rounds (columns 2 to 9) and the mode was determined for each amplitude curve \autoref{htPCR_nap_frequency}.

```{r, eval=TRUE, echo=TRUE}
# Use decision_modus() to go through each row of all classifications done by
# a human.

# Determine the number of observations where all classifications were 
# the same (conformity == TRUE).
conformity <- decision_res_htPCR[["conformity"]]

# List all classifications.
dec <- lapply(1L:nrow(decision_res_htPCR), function(i) {
    decision_modus(decision_res_htPCR[i, 2:9])[1]
}) %>% unlist()

# Show statistic of the decisions
summary(dec)
```

```{r htPCR_nap_frequency, echo=FALSE, fig.cap=htPCR_nap_frequency, fig.scap=htPCR_nap_short_frequency, fig.higth=2}
# Plot the Frequencies of the decisions
# Create graphic device for the plot(s)

layout(matrix(c(1,1,2), 1, 3, byrow = TRUE))
# Show the counts of negative, ambiguous and positive amplification 
# curves in a bar plot
# Build a contingency table of the counts at each
# combination of factor levels.
dec_table <- table(dec)

barplot(
    dec_table, xlab = "Class decision", ylab = "Frequency",
        main = "", col = c("red", "black", "green"), border = "white"
)
text(c(0.7, 1.9, 3.1), rep(4000,3), 
     c(paste("n = ", dec_table[1]), 
       paste("n = ", dec_table[2]), 
       paste("n = ", dec_table[3])),
    col = c("black", "white", "black"))
mtext("A", cex = 1, side = 3, adj = 0, font = 2)

barplot(
    table(conformity), xlab = "Comformity", ylab = "Frequency",
        main = "", col = c("black", "green"), border = "white"
)
text(c(0.7, 1.85), rep(8000,2), 
        c(paste("n = ", table(conformity)[1]), 
        paste("n = ", table(conformity)[2])), 
        col = c("white", "black"))
mtext("B", cex = 1, side = 3, adj = 0, font = 2)
```

Another usage mode of ``decision_modus()`` is to set the parameter as `max_freq=FALSE`. This option specifies the number of all classifications.

```{r, eval=TRUE, echo=TRUE}
library(PCRedux)
# Decisions for observation P01.W06
res_dec_P01.W06 <- decision_modus(decision_res_htPCR[
which(decision_res_htPCR[["htPCR"]] == "P01.W06"),
                                  2L:9
], max_freq = FALSE)
print(res_dec_P01.W06)
```

The amplification curve `P01. W06` was classified as `r paste0(res_dec_P01.W06$variable[1], "=", res_dec_P01.W06$freq[1])` times and as `r paste0(res_dec_P01.W06$variable[2], "=", res_dec_P01.W06$freq[2])` times. Therefore, the decision would turn `negative`.

### PCRedux-app

\texttt{PCRedux-app} is a web server, based on the shiny technology [@shiny_2016] wrapped around the ``encu()`` function (\autoref{section_pcrfit_single_pcrfit_parallel}). An user can upload qPCR data and download obtained amplification curve features. 

There are different ways to use the function.

- Through \texttt{RScript} (Scripting Front-End for R):
    - Enter the command ``Rscript -e 'PCRedux::run_PCRedux()'`` in a console and copy the pasted URL in a browser.

- Through Graphical User Interfaces:
    - The function can be started directly in \texttt{RStudio} or \texttt{RKWard} by:

```{r, echo=TRUE, eval=FALSE}
# run the Shiny app 
PCRedux::run_PCRedux()
```


# Summary and Conclusions  \label{section_Summary_and_conclusions}

Extensive amounts of data represent a serious challenge in the analysis of qPCR 
amplification curves. In a manual classification (e.&nbsp;g. negative, 
positive), the result is usually characterized by the subjective perception of 
the experimenter. In addition, the time required for a manual analysis is high. 
An automatic system for amplification curve analysis might objectify and 
generalize the decision process. Interestingly, so far most authors focused on 
the extraction of single predictors from amplification curves. These are mainly 
the Cq and the amplification efficiency, which are used for the downstream 
processing such as expression analysis or genotyping [@pabinger_2014]. 

Numerous software tools were developed, which deal with theses analytical steps. 
For example @baebler_quantgenius:_2017 published \texttt{quantGenius} and 
@mallona_chainy:_nodate published \texttt{Chainy}. However, none of them 
attempts to make use of characteristics of the amplification curve. Especially as this shape relies on the used chemistry and fluorochrome system [@ruijter_2014]. Even the cycler used may impact the amplification curve shape [@spiess_system-specific_2016]. Positive 
amplification curves usually exhibit a sigmoid shape, consisting of a ground 
phase, exponential phase, and plateau phase. Negative amplification curves 
resemble flat noisy signal. As a result, the experienced user is usually able to 
correctly interpret the curve shape. Similarly, outliers and measurement errors 
can also be identified. When setting up a qPCR assay, manual data analysis is a 
useful and necessary approach to familiarize oneself with the properties of the 
qPCR amplification curves. To sum it up in the main document we show that the shape has features that a seen by machine learning models (Random Forest).

It is challenging to analyze and classify amplification curves if they deviate 
substantially from the sigmoid shape or if their number is no longer feasible 
for manual analysis. Furthermore, the supposed objectivity of the user must also 
be questioned. In the scientific environment there is often the temptation - or 
rather the compulsion - to use all data for publications. As a result, 
amplification curves of rather poor quality may be provided that are not 
reproducible and suitable for an objective analysis of amplification curves. For 
a novice user, the quality of an amplification curve can be acceptable, for an 
experienced user, not.

Ambiguous amplification curves are a big challenge for the user, as both classes 
(positive and negative) can be true. In most cases, however, the user is 
interested in an automatic distinction. For example, between positive and 
negative samples. This is important for screening applications. In the **[online 
supplement](https://github.com/devSJR/PCRedux/raw/master/docs/articles/PCRedux.pdf)**, further reasons were elaborated on why mathematical/statistical methods are necessary 
for the objective and reproducible interpretation of the amplification curves.

In addition to the determination of quantification points, the classification of 
amplification curves is necessary. For example, a diagnostician is interested in 
whether a sample is positive or negative. In research using high-throughput 
screening methods, it is important to classify large data sets quickly and 
cost-effectively. It is important to bear in mind that in manual classification, 
the classification result is influenced by the subjective perception of the 
experimenter and that it is comparatively time-consuming.

Here, an automatic computer-assisted classification of amplification curves is 
feasible because it renders the entire analysis process faster, more objective 
and more reproducible. The objectives were therefore to

1. create a collection of classified amplification curve data,
2. propose algorithms that can be used to calculate predictors from amplification curves,
3. to develop pipelines (e.&nbsp;g., machine learning, decision trees) that can be used for automatic classification of amplification curves,
4. evaluate pipelines that can be used for an automatic classification of amplification curves based on the curve shape and
5. to bundle and distribute the findings in a public repository open source software and open data package with an open data.class

for an automatic analysis of amplification curve data by machine learning. 

For this purpose, the \texttt{PCRedux} package was developed. This package 
contains proof-of-concept algorithms and functions with which predictors 
(mathematically describable properties) of amplification curves can be 
calculated. The ``pcrfit_single()`` function is an extensible wrapper function 
for all algorithms and functions developed. \texttt{PCRedux} version `r 
packageVersion("PCRedux")` offers concepts to calculate `r 
length(pcrfit_single(chipPCR::C126EG595[, 2]))` predictors from an amplification 
curve. In addition, predictors such as the employed chemistry, the qPCR device 
and further experimental details can be added by the user. Other parameters 
(e.&nbsp;g. hydrolysis probe, DNA binding dye) can be converted into binary 
classifiers and be used for modeling. Machine learning requires predictors to 
train a model [@saeys_review_2007]. The model should then be able to put new 
unknown data into a meaningful context. Predictors can have a significant 
influence on the accuracy of the prediction model. The predictors of the 
\texttt{PCRedux} package are a novelty in the literature for the classification 
of amplification curves, so that the collection probably constitutes the most 
extensive one at the time of the first release on 
\url{https://github.com/devSJR/PCRedux} (summer 2017). 

It can be assumed that not all predictors are suitable or necessary for machine 
learning. Some **potential** predictors are more likely to be suitable for 
validation of data integrity (quality management) and data mining. The `maxRFU` 
predictors and `sigma_bg` (\autoref{plot_bg_pt}) are to be mentioned here as 
examples. The predictor `maxRFU` should ideally be at a value of 1. If this is 
not the case, it can be assumed that a preprocessing problem was encountered. 
The `sigma_bg` value describes the standard deviation of the ground phase. It 
should be low ($sigma\_bg \leq 0.1$), otherwise it can be assumed that unusually 
high variations of the intensity values are present. Consequently, the relevance 
of the predictors must be determined independently by each user on the basis of 
domain knowledge, the objective, and the given data set.

At this point it should be mentioned that these approaches were deliberately 
sought and developed to make the approach and implementation more 
understandable. Self-learning machine-learning methods are expected to be 
included in the \texttt{PCRedux} package in future releases, from the necessity 
to design and test additional predictors.

To a modest extent, the usefulness of the predictors was tested on exemplary 
data sets, with the aim of achieving a high degree of objectivity and 
reproducibility. It is probably not possible to fulfill this ideal completely, 
since the algorithms were designed from a limited human perspective, which poses 
a generic problem in machine learning. In particular, data records can be 
distorted if the user excludes seemingly problematic data. Hence, it is 
advisable to perform a more comprehensive analysis of all predictors. 

Several ROIs (see \autoref{amplification_curve_ROI}) can be obtained from 
amplification curves. Sigmoidal amplification curves have turning points that 
can serve as an indicator of a positive amplification curve, and are 
mathematical and statistical starting points for calculating predictors. 
Amplification curves can have unique trajectories and often deviate drastically 
from ideal sigmoid models (see \autoref{figure_sigmoid_curve}A and 
\autoref{figure_sigmoid_curve}B). Some amplification curves have only a slight 
increase with positive or negative signs lacking a sigmoid curvature.

Almost all real-time thermo-cyclers have built-in software that performs 
preprocessing steps such as smoothing, baseline correction and normalization on 
the amplification curves [@roediger2015chippcr; @spiess_impact_2015; 
@spiess_system-specific_2016]. For this reason, it is nearly impossible to get 
access to informative raw data, so that the impact on the predictor extraction 
process cannot be adequately estimated.

The volume and classification of data sets needs to be representative [@herrera_multiple_2016]. The \texttt{PCRedux} package contains a large number of manually classified amplification curves (\autoref{section_data_sets}). Here, data preparation is an important step,as it encompasses data cleansing, data transformation and data integration [@herrera_multiple_2016]. To assist, the \texttt{xray} package [@Seibelt_xray] and \texttt{assertr} package [@assertr] can be used to analyze the distribution and variables in records for important anomalies such as missing values, zeros, empty strings (Blank) and infinite numbers (Inf). Users of the\texttt{PCRedux} package should use such tools before proceeding with the analysis. Even though the data records (qPCR runs) in the \texttt{PCRedux} package have a very similar structure, some records contain missing values or have different dimensions. For example, the data set `C127EGHP` comprises a matrix of `r data_dim <- dim(chipPCR::C127EGHP); paste0(data_dim[1], " x ", data_dim[2])` (35 cycles x observations (65 amplification curves)), while the data set `htPCR` comprises a matrix of `r data_dim <- dim (qpcR::htPCR); paste0 (data_dim[1]," x ", data_dim[2])`.
Talking about data sets, it is important to make sure that the volume of the 
data is representative. In this study amplification curves were categorized by 
hand and processed with the algorithms described in this study. Although this 
data set is fairly large in comparison to what existed before, there is no 
numeric evidence how well it reflects amplification curves in general. In 
particular, not all data sets have comparable case numbers. For example, the 
`htPCR` data set (Biomark HD, Fluidigm) encompass in total `r 
ncol(qpcR::htPCR)-1` amplification curves, while the `C127EGHP` data set (iQ5, 
Bio-Rad) encompass in total `r ncol(chipPCR::C127EGHP)-2` amplification curves. 
While most of the amplification curves of the `C127EGHP` data set have a 
classical sigmoid curve shape, amplification curves from the `htPCR` data set 
largely display noisy curvatures with non-sigmoid shape. One may question if the 
larger data set introduces a confirmation bias.

Although labeling and amplification curve data sets in the \texttt{PCRedux} package are large (see \autoref{section_data_sets} how to obtain the open data (open source licenced)) compared to what 
has been previously available, there is no information how well these 
represent amplification curves in other settings. Bellman coined the term "Curse 
of Dimensionality" in 1961 when he dealt with adaptive control processes, 
vaguely describing the practical difficulties of high-dimensional analysis and 
estimation. There is only a maximum number of predictors for a sample of a 
certain size. If there are too many, the performance of an algorithm decreases 
rather than improves. As a result, many data mining algorithms fail with high 
dimensionality because the data points are sparse [@herrera_multiple_2016]. In 
addition, the following applies:

- The user's knowledge, prejudices, and skills are reflected in the classified data. For example, an amplification can be classified as "ambiguous" by one and "positive" by another user.
- Not all data sets have a comparable number of cases as described above. The most of the amplification curves of the `C127EGHP` data set have an *ideal* sigmoid waveform. In contrast, the amplification curves from the `htPCR` data set are noisy and difficult to classify.
- The user decides
    + how the data is preprocessed,
    + which predictors are determined by the amplification curves, 
    + which data set is used for machine learning and to what extent,
    + how the models are tested and
    + which results are reported.

Accordingly, the domain knowledge, biases and competences of the human operator 
are reflected by the software. For example, amplification that are classified by 
one human as 'ambiguous' might be classified as 'positive' by another human 
operator. This will affect (bias) the characteristics of the training data set. 
The model is intended to be in accordance with the human operator. The 
implications can be problematic. For example, amplification curves rated as 
false negative might lead to an adverse evidence in a forensic setting. As 
illustrated in \autoref{section_reasoning_and_analysis}, every human operator 
will make an association between the shape of an amplification curve and the 
class it belongs to.

Measures to minimize errors in manual classification include the implementation 
of algorithms to detect them. This should preferably be done in the form of open 
source packages with publicly accessible data sets. In contrast to black box 
algorithms and hidden data sets, third parties can review and modify all 
elements. Many qPCR devices have built-in software that performs preprocessing 
steps like smoothing, base-lining and normalization and the data sets 
[@roediger2015chippcr; @spiess_impact_2015; @spiess_system-specific_2016]. This 
will have an impact on the predictor extraction process. Same applies to the 
preprocessing steps in the \texttt{PCRedux} package, which have not been studied 
thoroughly.

For the examples in the \texttt{PCRedux} package, the question could not be 
clarified whether class imbalances in the data sets cause a confirmation 
distortion. However, a class imbalance may result in loss of prediction 
strength, because some classifiers make the assumption of similar class 
distributions [@herrera_multiple_2016].

A new concept for the fast group-wise classification of amplification curves was 
introduced within the \texttt{PCRedux} package. Based on experience with the 
collected data sets it can be stated that only a few iterations are necessary to 
classify a large data set. This part of the software is intended as an 
assistance tool for the users of the package.

Measures to minimize these sources of error are based on the implementation of 
algorithms in the form of an open source package, and the publication of the 
data sets. Unlike black box algorithms and inaccessible data sets, every user 
can check and correct all steps. However, it is advised that users of the 
\texttt{PCRedux} package verify independently whether their models are 
objective. Failure to do so may have severe implications, for instance, 
amplification curves classified as false negative may lead to misleading 
inferences in a forensic analysis. 

Machine-learning algorithms require careful data preprocessing and quality 
management. In a first step, relatively large data sets of known characteristic 
vectors have to be collected and depending on the machine learning approach, 
predictors calculated. In a second step, these characteristics are used to 
classify unknown characteristic vectors using the automatic learning algorithm. 
As an example, the amplification curves would need to be randomly split into 
training data and test data. Some examples were used in the \texttt{PCRedux} 
package to show how predictors of an amplification curve data set can be 
calculated and used for classifications.

<!-- The scope of the \texttt{PCRedux} package is wide. The quantitation of nucleic  -->
<!-- acids by curve parameters such as the Cq and amplification efficiency is  -->
<!-- meaningful only if the amplification curves have a sigmoid shape  -->
<!-- [@ruijter_evaluation_2013; @ruijter_2014; @Ritz2008], which in principle, can  -->
<!-- now be verified with the \texttt{PCRedux} package. -->

In this study, it was shown how a given data set of amplification curves with 
known classifications can be used to build a system that can predict the 
classification of amplification curves. The algorithms provide means for a 
sensitive and specific classification of amplification curves from qPCR 
experiments in both supervised an unsupervised analysis mode. However, this 
method might be applicable to melting analysis too. This needs to be 
investigated in further studies.

Importantly, the concepts elucidated in this work may also be applied to other 
bioanalytical methods (e.&nbsp;g. enzyme kinetics, receptor binding studies, 
ELISA results, biological growth curves) with sigmoidal structure, however this 
requires more detailed interrogation. The \texttt{PCRedux} software may also be 
coupled with other technologies like Next Generation Sequencing. qPCR is used 
for pretesting (DNA quality) and as a confirmation test for RNA-Seq 
quantification [@nassirpour_identification_2014]. For this purpose, automated 
quality control and decision support are conceivable.

# References