diff --git a/analysis/paper/03_revised_version/paper_revised.Rmd b/analysis/paper/03_revised_version/paper_revised.Rmd index 57c1426..b172171 100644 --- a/analysis/paper/03_revised_version/paper_revised.Rmd +++ b/analysis/paper/03_revised_version/paper_revised.Rmd @@ -85,11 +85,11 @@ This paper examines some of the objections with the help of simulations. This re # Background -```{r dependencies-chart, fig.cap="Interdependency between amount of information, intensity of pattern and desired uncertainity."} +```{r dependencies-chart, fig.cap="Interdependency between amount of information, intensity of pattern and desired uncertainty."} create_graph() %>% add_node(label = "Number of Data") %>% add_node(label = "Strength of Signal") %>% - add_node(label = "Certainity of Identification") %>% + add_node(label = "Certainty of Identification") %>% select_nodes() %>% set_node_attrs_ws("shape", "rectangle") %>% set_node_attrs_ws("fixedsize", "FALSE") %>% @@ -122,9 +122,9 @@ In their article, Contreras and Meadows [-@contreras_summed_2014] worked on this When using an over-representatively high number of data, or such a data density, with 1000 data for a period of 1000 -- 1700 BCE (density 1.43 data per year), the Black Plague would basically emerge [@contreras_summed_2014, fig. 3]. However, they argue that but the strength of the event in the resulting signal could not be attributed to such a disaster without prior knowledge [@contreras_summed_2014, 599]. At a density they consider to be closer to the archaeological reality – the authors assume this to be 0.29, representing 200 data for 700 years – the sampling effect would prevent the underlying demographic processes from being properly represented by the simulated ^14^C data [@contreras_summed_2014, fig. 6]. They write: 'Not only is the departure of these curves from the population distribution from which they are derived evident; the variability between samples is also notable: the most prominent fluctuations in each curve are not visible in most of the others' [@contreras_summed_2014, 601]. In general, the data density is decisive for the effectiveness of this estimator, whereby even with the maximum simulated number of dates (2000) the Black Death is 'far from obvious' as an event [@contreras_summed_2014, 602]. In addition, they argue, the temporal fixation of the event is problematic due to the scatter effect especially of legacy data with high standard deviation. Thus it would be not possible to separate signal from noise, to separate false-positive and false-negative from real results, and to identify the exact timing and magnitude of the underlying phenomenon [@contreras_summed_2014, 603--605]. In their concluding remarks they consequently state that 'even under ideal conditions, it is difficult to distinguish between real and spurious population patterns, or to accurately date sharp fluctuations, even with data densities much higher than in most published attempts' [@contreras_summed_2014, 605]. -With all the importance that the simulation approach adds to this paper, unfortunately, the authors do not use its full potential. Although creating different scenarios, each is only examine with five simulation runs (for 200, 1000 and 2000 samples respectively) [@contreras_summed_2014, 596]. Even if five is more than one, this certainly does not represent a statistically reliable basis for a far-reaching statement. In addition, they state as paraphrased above, that the Black Plague could have remained undetected, without further specification or quantification. A significantly higher number of simulations might be mandatory for such a statement. +With all the importance that the simulation approach adds to this paper, unfortunately, the authors do not use its full potential. Although creating different scenarios of data density, each is only examine with five simulation runs (for 200, 1000 and 2000 samples respectively) [@contreras_summed_2014, 596]. Even if five is more than one, this certainly does not represent a statistically reliable basis for a far-reaching statement. In addition, they state as paraphrased above, that the Black Plague could have remained undetected, without further specification or quantification. A significantly higher number of simulations might be mandatory for such a statement. A very important step in this direction has already been taken by [@mclaughlin_applications_2019], who has reviewed the Black Plague scenario in his article on using the KDE model for similar analyses, and who has already come up with detection rates. A perfect pattern recognition was achieved with a sample number of 3000. Here, however, only 30 simulation runs were checked in each case, and the effect strength was not varied. -Precisely against this background the triangle of effect strength, data quantity and certainty of identification should be quantified here. Using the same basic pattern, the Black Plague, the aim is to determine, for different scenarios of effect strength and data quantity, in how many of cases such a demographic catastrophe could have remained undetected. It is primarily a question of false-negative results. False positives can be meaningfully detected by other simulation approaches, as it has been discussed elsewhere [eg. @shennan_regional_2013] and as it will be applied in a later step (see below). +Precisely against this background the triangle of effect strength, data quantity and certainty of identification should be quantified here. Using the same basic pattern, the Black Plague, the aim is to determine, for different scenarios of effect strength and data quantity, in how many of cases such a demographic catastrophe could have remained undetected. It is primarily a question of false-negative results. False positives can be meaningfully detected by other simulation approaches, as it has been discussed elsewhere [eg. @shennan_regional_2013; @edinborough_radiocarbon_2017] and as it will be applied in a later step (see below). # Methods @@ -136,11 +136,11 @@ The overall approach and the implemented workflow consists of three main parts: To simulate different densities of ^14^C dates, 18 scenarios were created (30--90 in steps of ten, 100--900 in steps of one hundred, 1000--2000 in steps of one thousand). For each scenario, 200 simulation runs were used. The whole process is controlled by a superimposed control structure. -In the first part of the analysis, the original scenario of Contreras and Meadows [-@contreras_summed_2014] was reconstructed. The population curve was reconstructed and for different numbers of simulated samples, the signal was detected as described below. This process was repeated 200 times for each parameterization of the number of samples in order to obtain a statistical basis for the evaluation. The proportion of detected patterns was recorded, and the scenarios themselves were repeated 200 times to capture the range of variation between runs. This resulted in 720,000 individual simulation runs. +In the first part of the analysis, the original scenario of Contreras and Meadows [-@contreras_summed_2014] was reconstructed. The population curve was reconstructed and for different numbers of simulated samples, the signal was detected as described below. This process was repeated 200 times for each parameterization of the number of samples in order to obtain a statistical basis for the evaluation. The proportion of detected patterns was recorded, and the scenarios themselves were repeated 200 times to capture the range of variation between runs. Although the scattering of the detection results with respect to the standard deviation of the successful detection is primarily a function of the sample size (200 repetitions) and the true detection rate, this exhaustive test setup was chosen in order to account for any nonlinear effects resulting from the shape of the calibration curve. This resulted in 720,000 individual simulation runs (200 batches of 200 simulations of 18 scenarios). The signal strength, i.e. the intensity with which the demographic signal decreases, is 77.4% in the 'real' data of the Black Plague. In the second part of the analysis, signal strengths of 30%-90% were simulated in steps of ten, respectively the data set of the Black Plague was changed in such a way that such a demographic change is predetermined by the data set. This results in a total of 126 scenarios. For each of the scenarios, 200 simulation runs were carried out, resulting in a total of 25,200 individual runs. The repetition of individual scenarios was omitted as this would have considerably increased the runtime of the algorithm. -This process was repeated for both settings including the test against false positives as described below. In total, the whole simulation includes 1,490,400 individual sum calibrations. The choice for the final number of runs and repetitions resulted from the total run time, which was 94480 seconds or 26 hours and 15 minutes (using parallel computing on 6 cores of an Intel(R) Xeon(R) CPU E3-1240 v5 at 3.50GHz with 16 GB RAM). +This process was repeated for both settings including the test against false positives as described below. In total, the whole simulation includes 1,490,400 individual sum calibrations. The choice for the final number of runs and repetitions resulted from the total run time, which was 94480 seconds or 26 hours and 15 minutes (using parallel computing on 6 cores of an Intel(R) Xeon(R) CPU E3-1240 v5 at 3.50GHz with 16 GB RAM).^[In the course of the review of this paper, both reviewers independently suggested that the simulation should be performed for other temporal positions in order to check whether the results of the signal detection are robust to artifacts in the calibration curve. I do not see per se any methodological reasons why this would lead to significantly different results, since the curve in this period is quite comparable e.g. with that of the later Neolithic (a plateau between 1100 - 1200 CE and a wiggle between 1300 - 1400 CE, comparable e.g. with a wiggle between 3500 - 3400 BCE and a plateau between 3300 - 3100 BCE). Nevertheless, this is an interesting starting point for a possible further paper, but it would go beyond the scope of the analysis presented here.] ## Simulation of the ^14^C dates @@ -174,7 +174,7 @@ do.call("grid.arrange", c(plot_collector, ncol=2)) The smoothing of the resulting calibration result with a moving average, as suggested by [@williams_use_2012] with a window of 500 years minimum, was considered, but rejected again. The reason for this is that the more turbulent curve of the calibration result produces a more realistic scenario (see fig. \@ref(fig:example-smoothed-vs-unsmoothed)). ## Detection of the signal -```{r example-rejected, fig.cap="Four examples of rejected results (signal not detected) using the original signal strength and 200 dates."} +```{r example-rejected, fig.cap="Four examples of rejected results (signal not detected) using the original signal strength and 200 dates. Orange Area: where a minimum should be present. Blue Area: Where the signal should be at least 10% higher than in the minimum on average."} rejected_sims <- 0 max_tries <- 50 i=0 @@ -188,7 +188,11 @@ while( rejected_sims < 4 & i < max_tries ){ smooth_sumcal_result(1) this_sim_detected <- detect_pattern(this_sim_result) if(!this_sim_detected) { - this_plot <- ggplot(this_sim_result, aes(x=dates,y=probabilities)) + geom_rect(aes(xmin=1310, xmax=1530, ymin=0, ymax=Inf), color="transparent", fill="orange", alpha=0.3) + geom_line() + theme_linedraw() + xlim(c(1000, 1800)) + this_plot <- ggplot(this_sim_result, aes(x=dates,y=probabilities)) + + geom_rect(aes(xmin=1310, xmax=1530, ymin=0, ymax=Inf), color="transparent", fill="orange", alpha=0.3) + + geom_rect(aes(xmin=1260 - 50, xmax=1260 + 50, ymin=0, ymax=Inf), color="transparent", fill="lightblue", alpha=0.3) + + geom_rect(aes(xmin=1580 - 50, xmax=1580 + 50, ymin=0, ymax=Inf), color="transparent", fill="lightblue", alpha=0.3) + + geom_line() + theme_linedraw() + xlim(c(1000, 1800)) plot_collector <- c(plot_collector, list(this_plot)) rejected_sims <- rejected_sims + 1 } @@ -197,7 +201,7 @@ while( rejected_sims < 4 & i < max_tries ){ do.call("grid.arrange", c(plot_collector, ncol=2)) ``` -```{r example-accepted, fig.cap="Four examples of accepted results (signal detected) using the original signal strength and 200 dates."} +```{r example-accepted, fig.cap="Four examples of accepted results (signal detected) using the original signal strength and 200 dates. Orange Area: where a minimum should be present. Blue Area: Where the signal should be at least 10% higher than in the minimum on average."} accepted_sims <- 0 max_tries <- 50 i=0 @@ -210,7 +214,11 @@ while( accepted_sims < 4 & i < max_tries ){ smooth_sumcal_result(1) this_sim_detected <- detect_pattern(this_sim_result) if(this_sim_detected) { - this_plot <- ggplot(this_sim_result, aes(x=dates,y=probabilities)) + geom_rect(aes(xmin=1310, xmax=1530, ymin=0, ymax=Inf), color="transparent", fill="orange", alpha=0.3) + geom_line() + theme_linedraw() + xlim(c(1000, 1800)) + this_plot <- ggplot(this_sim_result, aes(x=dates,y=probabilities)) + + geom_rect(aes(xmin=1310, xmax=1530, ymin=0, ymax=Inf), color="transparent", fill="orange", alpha=0.3) + + geom_rect(aes(xmin=1260 - 50, xmax=1260 + 50, ymin=0, ymax=Inf), color="transparent", fill="lightblue", alpha=0.3) + + geom_rect(aes(xmin=1580 - 50, xmax=1580 + 50, ymin=0, ymax=Inf), color="transparent", fill="lightblue", alpha=0.3) + + geom_line() + theme_linedraw() + xlim(c(1000, 1800)) plot_collector <- c(plot_collector, list(this_plot)) accepted_sims <- accepted_sims + 1 } @@ -219,7 +227,7 @@ while( accepted_sims < 4 & i < max_tries ){ do.call("grid.arrange", c(plot_collector, ncol=2)) ``` -To achieve an automated detection of the signal in the calibration result, an algorithm was written which performs this task. The local minima between 1210 and 1630 were recorded and the strongest minimum was selected. If this was not in the period between 1310 and 1530, i.e. the minimum in the population curve of the Black Plague, the result was discarded as non-match. It was then tested whether this minimum was at least 10% below the mean of the 100 years preceding and following the event with a lag of 50 years. Only if this was the case the signal was considered as detected. A selection of random examples of accepted and rejected calibration results can be found in Figure \@ref(fig:example-rejected) resp. \@ref(fig:example-accepted), or can be easily generated using the reproducible code itself. +To achieve an automated detection of the signal in the calibration result, an algorithm was written which performs this task. The local minima between 1210 and 1630 were recorded and the strongest minimum was selected. If this was not in the period between 1310 and 1530, i.e. the minimum in the population curve of the Black Plague, the result was discarded as non-match. It was then tested whether this minimum was at least 10% below the mean of the 100 years preceding and following the event with a lag of 50 years (1260 resp. 1580). Only if this was the case the signal was considered as detected. A selection of random examples of accepted and rejected calibration results can be found in Figure \@ref(fig:example-rejected) resp. \@ref(fig:example-accepted), or can be easily generated using the reproducible code itself. ## Combination of the results @@ -231,13 +239,15 @@ Accordingly, mean value, standard deviation, internal quartile and 95% interval Shennan et al. [-@shennan_regional_2013] used a Monte Carlo simulation method that produces simulated data distributions under an adjusted null model. These are then used to test characteristics in the observed data set for statistically significant patterns. A large number of individual simulations are carried out using the null model as the population curve, similar to the simulation technique described above. The interval in which the simulated data ranges reflects the element of random sample distribution. Since the 5% significance boundary is set as the statistical standard, the 95% interval (i.e. the quantiles 0.025 and 0.975) is usually taken from the simulated data. A signal, to be evaluated as significant and thus 'real', must lie outside this fluctuation range. -This approach, with slightly different settings, has since become established as the standard procedure for checking the patterns detected in sum calibrations. While, for example, Shennan et al. [-@shennan_regional_2013] uses an exponential generalized linear model for the null model, which is adapted to the data, a simpler approach is chosen here as in other publications [@hinz_chalcolithicbronze_2019]. The null model is a uniform distribution of the data within a specific time window. Thus, no assumption about a possible population development is made in advance, as would be the case with an exponential function in the sense of population growth. With this, I assume a stable population, and those events, which fall out of the hull generated by the simulation, can be considered as significantly different from this null model. A specific helper function is implemented in the package oxcAAR [@hinz_oxcaar_2018] (`oxcalSumSim()`), which can be used to easily perform such a simulation. It have to be noted that this function is based on `R_Simulate` of OxCal, and therefore shows rather wider uncertainty ranges than it would be necessary for `C_Simulate`. In the given context, this rather increases the robustness of the estimation. +This approach, with slightly different settings, has since become established as the standard procedure for checking the patterns detected in sum calibrations. While, for example, Shennan et al. [-@shennan_regional_2013] uses an exponential generalized linear model for the null model, which is adapted to the data, a simpler approach is chosen here as in other publications [@hinz_chalcolithicbronze_2019]. The null model is a uniform distribution of the data within a specific time window. Thus, no assumption about a possible population development is made in advance, as would be the case with an exponential function in the sense of population growth. With this, I assume a stable population, and those events, which fall out of the hull generated by the simulation, can be considered as significantly different from this null model. A specific helper function is implemented in the package oxcAAR [@hinz_oxcaar_2018] (`oxcalSumSim()`), which can be used to easily perform such a simulation. It has to be noted that this function is based on `R_Simulate` of OxCal, and therefore shows rather wider uncertainty ranges than it would be necessary for `C_Simulate`. In the given context, this rather increases the robustness of the estimation. + +For the original methodology of Shennan et al. [-@shennan_regional_2013] an extension has recently been proposed [@edinborough_radiocarbon_2017], which allows a more local and specific approach to hypothesis testing with respect to sum calibration. This expansion will not be further explored in the following, even though it has been successfully applied to the Black Death scenario. The reason is that in this paper I am mainly interested in the general detectability even in the absence of previous knowledge (as it may be available from literary sources), and therefore prefer the simplest possible parameterization. ## Reproducible Research in Simulation studies Reproducibility has not yet become the standard for archaeological analysis. In many cases, the way archaeological data are collected prevents complete reproducibility of results, as an excavation can only be carried out once. However, in the case of derived, secondary analyses, reproducibility is clearly a preferable design consideration in any research. This is all the more true for simulation studies, which naturally rely on random effects and should therefore be reproducible in their parameterization and which also create the perfect conditions for such a research design regarding their data base. -Unfortunately, especially in the field of summed ^14^C analyses, it is often the case that the argumentation relates on single observations or single calibration runs, i.e. only few results are presented pars pro toto. At the same time, the source code used to generate these numbers is usually not included in the paper and is also not accessible elsewhere. Therefore the results must be believed ab auctoritate. A listing of related papers is deliberately omitted here. +Unfortunately, especially in the field of summed ^14^C analyses, it is often the case that the argumentation relates on single observations or single calibration runs, i.e. only few results are presented pars pro toto. At the same time, the source code used to generate these numbers is usually not included in the paper and is also not accessible elsewhere. Therefore the results must be believed as *argumentum ab auctoritate*. A listing of related papers is deliberately omitted here. If the source code is available or at least reconstructable [as in @contreras_summed_2014], a big step towards reproducibility has already been taken. In this article I try to go one step further and choose an Open Science approach in the sense of reproducible research [in the sense of @marwick_computational_2017]. The code underlying the simulations is made available together with the article, based on the package rrtools (https://github.com/benmarwick/rrtools). It is available as an R package (sensitivity.sumcal.article.2020) and can be obtained directly (https://github.com/MartinHinz/sensitivity.sumcal.article.2020) or from a repository (Zenodo, doi: [10.5281/zenodo.3613674](https://doi.org/10.5281/zenodo.3613674)). With this, all results should be easily reproducible and verifiable, especially the settings of the simulation should be available for direct verification. @@ -256,21 +266,21 @@ render_orig_sim_result_table(result$orig_sim) ``` ```{r results='asis'} -table_caption("orig-sim-result-table", "Results from the simulation of the original setup of [@contreras_summed_2014].") +table_caption("orig-sim-result-table", "Results from the simulation (200 runs for each number of samples) of the original setup of [@contreras_summed_2014].") ``` -The results of the reproduction of the original scenario can be seen in Table \@ref(tab:orig-sim-result-table). For the situation of 1000 samples for 700 years described by the authors as super-ideal (results in a density of `r round(1000 / 700,2)`) a detection rate of `r round(mean(result$orig_sim["1000",]) * 100, 2)`% results. In half of the cases, the value was between `r round(quantile(result$orig_sim["1000",], c(0.25)) * 100, 2)`% and `r round(quantile(result$orig_sim["1000",], c(0.75)) * 100, 2)`%, 95% of the values lay between `r round(quantile(result$orig_sim["1000",], c(0.025)) * 100, 2)`% and `r round(quantile(result$orig_sim["1000",], c(0.975)) * 100, 2)`%. +The results of the reproduction of the original scenario can be seen in Table \@ref(tab:orig-sim-result-table). For the situation of 1000 samples for 700 years described by the authors as super-ideal (results in a density of `r round(1000 / 700,2)`) a detection rate of `r round(mean(result$orig_sim["1000",]) * 100, 1)`% results. In half of the cases, the value was between `r round(quantile(result$orig_sim["1000",], c(0.25)) * 100, 1)`% and `r round(quantile(result$orig_sim["1000",], c(0.75)) * 100, 1)`%, 95% of the values lay between `r round(quantile(result$orig_sim["1000",], c(0.025)) * 100, 1)`% and `r round(quantile(result$orig_sim["1000",], c(0.975)) * 100, 1)`%. -For the case of a sample size of 200 Contreras and Meadows [-@contreras_summed_2014, 601] estimated as realistic, the mean detection rate is `r round(mean(result$orig_sim["200",]) * 100, 2)`%, with the inner quartile between `r round(quantile(result$orig_sim["200",], c(0.25)) * 100, 2)`% and `r round(quantile(result$orig_sim["200",], c(0.75)) * 100, 2)`% and the 95% interval between `r round(quantile(result$orig_sim["200",], c(0.025)) * 100, 2)`% and `r round(quantile(result$orig_sim["200",], c(0.975)) * 100, 2)`%. +For the case of a sample size of 200 Contreras and Meadows [-@contreras_summed_2014, 601] estimated as realistic, the mean detection rate is `r round(mean(result$orig_sim["200",]) * 100, 1)`%, with the inner quartile between `r round(quantile(result$orig_sim["200",], c(0.25)) * 100, 1)`% and `r round(quantile(result$orig_sim["200",], c(0.75)) * 100, 1)`% and the 95% interval between `r round(quantile(result$orig_sim["200",], c(0.025)) * 100, 1)`% and `r round(quantile(result$orig_sim["200",], c(0.975)) * 100, 1)`%. -Thus, the estimation of Contreras and Meadows [-@contreras_summed_2014] was not completely unjustified. The signal could have been overlooked, following the original simulation setup, with a probability of 1/3. The detection chance seems to be relatively independent of the sample size. +Thus, the estimation of Contreras and Meadows [-@contreras_summed_2014] was not completely unjustified. The signal could have been overlooked, following the original simulation setup, with a probability of 1/3. The detection chance seems to be relatively independent of the sample size (once the sample density has surpassed 300). -```{r orig-sim-result-boxplot, fig.cap="The results of the simulation of the original setup with 100 runs for each number of samples, visualised as boxplot (comp. tab. \\@ref(tab:orig-sim-result-table))."} +```{r orig-sim-result-boxplot, fig.cap="The results of the simulation of the original setup with 200 runs for each number of samples, visualised as boxplot (comp. tab. \\@ref(tab:orig-sim-result-table))."} source(here::here('R/render_plots.R')) render_orig_sim_result_boxplot(result$orig_sim) ``` -```{r orig-sim-result-regression, fig.cap="The results of the simulation of the original setup with 100 runs for each number of samples, visualised as plot with smoothed trendline (comp. tab. \\@ref(tab:orig-sim-result-table)). Please note that the x-values are slightly jittered for better recognision of the individual dates, and the x-axis is logarithmic."} +```{r orig-sim-result-regression, fig.cap="The results of the simulation of the original setup with 200 runs for each number of samples, visualised as plot with smoothed trendline (comp. tab. \\@ref(tab:orig-sim-result-table)). Please note that the x-values are slightly jittered for better recognision of the individual dates, and the x-axis is logarithmic."} render_orig_sim_result_regression(result$orig_sim) ``` @@ -296,17 +306,14 @@ render_full_sim_result_regression(result$full_sim) It is obvious that the strength of the signal has a high influence on the detection rate (Fig. \@ref(fig:full-sim-result-regression)). Signals resulting from an underlying population reduced to 70% or less have a significantly higher detection rate, especially with higher sample numbers. - ```{r full_sim_result_linear_model, echo=FALSE} data_for_lm <- melt(result$full_sim) colnames(data_for_lm) <- c("nsamples", "signal_strength", "p_signal_detected") lm_sum <- summary(lm(p_signal_detected ~ nsamples + signal_strength, data = data_for_lm)) - -lm_sum ``` -If the relationship between detection rate, sample size and signal strength is considered a linear model, then both factors are significant predictors for the detection rate, signal strength (coefficient of `r format(lm_sum$coefficients[2,1], scientific = T, digits = 3)` with a significance of `r format(lm_sum$coefficients[2,4], scientific = T, digits = 3)`) is clearly more dominant than the sample size (coefficient of `r format(lm_sum$coefficients[3,1], scientific = T, digits = 3)` with a significance of `r format(lm_sum$coefficients[3,4], scientific = T, digits = 3)`). +If the relationship between detection rate, sample size and signal strength is considered a linear model (see Appendix A.1.), then both factors are significant predictors for the detection rate, signal strength (coefficient of `r format(lm_sum$coefficients[2,1], scientific = T, digits = 3)` with a p-value of `r format(lm_sum$coefficients[2,4], scientific = T, digits = 3)`) is clearly more dominant than the sample size (coefficient of `r format(lm_sum$coefficients[3,1], scientific = T, digits = 3)` with a p-value of `r format(lm_sum$coefficients[3,4], scientific = T, digits = 3)`). It can be seen that a signal strength of 90% (corresponds to a reduction of 10%) with a small number of samples also shows a detection rate of more than 50%. This is rather surprising since the minimum difference necessary for recognition in the detection algorithm is set to 0.1. It is also surprising that this detection rate drops significantly with larger sample sizes (Fig. \@ref(fig:full-sim-result-regression)). This is a strong indication that false-positive signals, which result exclusively from the random distribution of the data and not from the underlying pattern, are also counted here. This touches one of the key questions posed by Contreras and Meadows [-@contreras_summed_2014]: Is it possible to distinguish real signals from false positives? To evaluate this, in a third step the same analysis was performed with the inclusion of a confidence envelope for false-positive signals. @@ -324,10 +331,12 @@ table_caption("envelope-sim-result-table", "Results from the simulation of diffe render_full_sim_result_regression(result$envelope_sim) ``` -In the same manner like the results above, Tab. \@ref(tab:envelope-sim-result-table) resp. Fig. \@ref(fig:envelope-sim-result-regression) visualise the effect of removal of false-positive pattern (section \@ref(elimination-of-false-positive-results)). In this version, the results for weak signals remain at a low level, while those for strong signals rise sharply from a sample size of about 200. For all signal strengths greater than 0.6, at the latest for a sample size of 300 or more, these exceed the 50% mark. This implies that this method produces a much more reliable result and is a strong indicator of the effectiveness of this approach. The overall detection rate is significantly reduced, and it becomes clear that for reliable identification of events a much higher sample size is necessary than if the possible false positives are naively ignored. +In the same manner like the results above, Tab. \@ref(tab:envelope-sim-result-table) and Fig. \@ref(fig:envelope-sim-result-regression) visualise the effect of removal of false-positive pattern (section \@ref(elimination-of-false-positive-results)). In this version, the results for weak signals remain at a low level, while those for strong signals rise sharply from a sample size of about 200. For all signal strengths greater than 0.6, at the latest for a sample size of 300 or more, these exceed the 50% mark. This implies that this method produces a much more reliable result and is a strong indicator of the effectiveness of this approach. The overall detection rate is significantly reduced, and it becomes clear that for reliable identification of events a much higher sample size is necessary than if the possible false positives are naively ignored. ```{r envelope-orig-sim-result-table} -render_full_sim_result_table(result$envelope_orig_sim) +env_res_results <- result$envelope_orig_sim +colnames(env_res_results)[1] <- "detection rate" +render_full_sim_result_table(env_res_results) ``` ```{r results='asis'} @@ -338,11 +347,11 @@ table_caption("envelope-orig-sim-result-table", "Results from the simulation of render_full_sim_result_regression(result$envelope_orig_sim) ``` -Finally, this combined method was applied again to the original example of the Black Plague with its fixed signal strength. The table \@ref(tab:envelope-orig-sim-result-table) or Fig. \@ref(fig:envelope-orig-sim-result-regression) show the corresponding results. The scope of the sample size was extended upwards. Considering possible false-positive results, the detection rate for this pattern is quite low. For the scenario with 200 data, corresponding to a density of `r round(200 / 700,2)` per year, there is a detection rate of `r result$envelope_orig_sim['200',]`, for a sample size of 1000, corresponding to a density of `r round(1000 / 700,2)` per year, the detection rate reaches `r result$envelope_orig_sim['200',]`. Only with a sample size of 2000, (density = `r round(2000 / 700, 2)`), a more or less reliable identification can be assumed. +Finally, this combined method was applied again to the original example of the Black Plague with its fixed signal strength. The table \@ref(tab:envelope-orig-sim-result-table) and Fig. \@ref(fig:envelope-orig-sim-result-regression) show the corresponding results. The scope of the sample size was extended upwards. Considering possible false-positive results, the detection rate for this pattern is quite low. For the scenario with 200 data, corresponding to a density of `r round(200 / 700,2)` per year, there is a detection rate of `r result$envelope_orig_sim['200',]`, for a sample size of 1000, corresponding to a density of `r round(1000 / 700,2)` per year, the detection rate reaches `r result$envelope_orig_sim['200',]`. Only with a sample size of 2000, (density = `r round(2000 / 700, 2)`), a more or less reliable identification can be assumed. # Discussion -When using estimators for reconstructions, it is clear that in addition to the existing uncertainties, the variability in the relationship of the estimator to the estimated variable has to be considered. Therefore, it is unrealistic to expect a perfect reconstruction. Nevertheless, the question of Contreras and Meadows [-@contreras_summed_2014] is, of course, justified as to whether such a disruptive event as the Black Plague could have been recognised through this methodology. Therefore, the example is very well chosen and was used here for the same reason. The answer to this question must be 'yes', even if one has to limit, 'not in every case', or more precisely, 'with `r round(mean(result$orig_sim["200",]) * 100, 2)`% probability', given the original setup of their analysis. +When using estimators for reconstructions, it is clear that in addition to the existing uncertainties, the variability in the relationship of the estimator to the estimated variable has to be considered. Therefore, it is unrealistic to expect a perfect reconstruction. Nevertheless, the question of Contreras and Meadows [-@contreras_summed_2014] is, of course, justified as to whether such a disruptive event as the Black Plague could have been recognised through this methodology. Therefore, the example is very well chosen and was used here for the same reason. The answer to this question must be 'yes', even if one has to limit, 'not in every case', or more precisely, 'with `r round(mean(result$orig_sim["200",]) * 100, 1)`% probability', given the original setup of their analysis. The other question raised by that paper is to what extent false positives can be distinguished from real signals. Here the approach of a hypothesis test based on bootstrapping was applied. This proved to be very capable of filtering out false positive signals due to their lower magnitude. On the other hand, when this parameterisation is applied, an average of 5% remains, which is not recognized as false positive. However, this is certainly by far an order of magnitude with which a science such as archaeology can operate, since very few approaches in our discipline offer 95 per cent confidence. @@ -352,7 +361,7 @@ A false-negative result, or in other words a Type II error, might be considered The results raise questions about the nature of the conclusions already made using ^14^C sum calibration. These results demonstrate distributions of ^14^C data on the temporal axis different from the results of random sampling processes. The simulation results in this study clearly support the assumption that the significance test based on a Monte Carlo simulation, in one form or another, is very well suited to filter out signals that only appear in the data due to sampling effects. Therefore, significant results are highly likely to show variations in the data background. The basic question is therefore not of a statistical nature, lies not in the insensitivity of the estimator himself, but rather in the fundamental methodological question of what information the absence of archaeological material at a certain period can provide us with. If, as in the case of the analysis of Shennan et al. [-@shennan_regional_2013], we see a reduction of 36% (signal strength of 0.64) and take it literally [estimated at @shennan_regional_2013, fig. 4, on 200 years rolling mean, with a peak at about 3500 BCE and a minimum at about 3000 BCE], is this a population change that we can assume to be realistic for a prehistoric population? If we take the estimates of [@muller_8_2015] into account, we are talking about 5 million people in Europe during this period. These would be reduced to about 3.2 million over 500 years. Another question is what true signal strength is indicated by an observed signal strength of 0.64, and how high the uncertainty range of such an estimate is. The latter estimate goes beyond the scope of this study but is an excellent question for another simulation-based study. -The main problem in using summation calibration as a means of demographic reconstruction is therefore not the statistical conditions. If the sample size and sensitivity are too small, there are possibilities in this domain to identify such problems and to counteract them if necessary. The main problem lies rather in the often biased distribution by the production of the data in studies often carried out with specific scientific objectives as well as in the fact that often quite different deposition processes are treated equally. This is the original approach of Rick [-@rick_dates_1987], in which the number of data was largely equated directly with the intensity of human activity, even when he has already identified these biases. Therefore it is necessary to find appropriate countermeasures and to establish a best-practise catalogue for such investigations. An assessement of how strong a signal can be detected with what data density can be a valuable first step in direction of such a standardisation. +The main problem in using summation calibration as a means of demographic reconstruction is therefore not the statistical conditions. If the sample size and sensitivity are too small, there are possibilities in this domain to identify such problems and to counteract them if necessary. The main problem lies rather in the often biased distribution by the production of the data in studies often carried out with specific scientific objectives as well as in the fact that often quite different deposition processes are treated equally. This is the original approach of Rick [-@rick_dates_1987], in which the number of data was largely equated directly with the intensity of human activity, even when he has already identified these biases. Therefore it is necessary to find appropriate countermeasures and to establish a best-practise catalogue for such investigations. An assessment of how strong a signal can be detected with what data density can be a valuable first step in direction of such a standardisation. # Conclusion @@ -362,7 +371,14 @@ In this article a simulation approach was used to move beyond the simple stateme
-### Colophon +# Appendix +### A.1. Linear Model of detection rate, sample size and signal strength + +```{r full_sim_result_linear_model_output} +lm_sum +``` + +### A.2. Colophon This report was generated on `r Sys.time()` using the following computational environment and dependencies: diff --git a/analysis/paper/03_revised_version/paper_revised.docx b/analysis/paper/03_revised_version/paper_revised.docx new file mode 100644 index 0000000..3ecd02d Binary files /dev/null and b/analysis/paper/03_revised_version/paper_revised.docx differ diff --git a/analysis/paper/03_revised_version/paper_revised.html b/analysis/paper/03_revised_version/paper_revised.html new file mode 100644 index 0000000..db91d28 --- /dev/null +++ b/analysis/paper/03_revised_version/paper_revised.html @@ -0,0 +1,2197 @@ + + + + + + + + + + + + + +Sensitivity of Radiocarbon Sum Calibration + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + +
+ + + + + + + +
+

1 Introduction

+

In recent years, the use of more or less large collections of 14C data has almost become a standard tool as estimators for demographic developments of the past. The original approach of Rick (Rick 1987) assumes that ‘if archaeologists recovered and dated a random, known percentage of the carbon from a perfectly preserved carbon deposit to which each person-year of occupation contributed an equal and known amount, they could estimate the number of people who inhabited a region during a given period’ (Rick 1987: 56).

+

There is a history of debate on the use of radiocarbon SPDs. Several authors use the method in different elaborations to identify past processes (most often demographic processes, eg. Armit, Swindles and Becker 2013; Buchanan, Collard and Edinborough 2008; Collard et al. 2013; Gamble et al. 2005; Gkiasta et al. 2003; Hinz et al. 2012; Hoffmann, Lang and Dikau 2008; Johnstone, Macklin and Lewin 2006; Kelly et al. 2013; Mulrooney 2013; Rick 1987; Riede 2009; Rieth et al. 2011; Shennan 2009, 2012; Shennan and Edinborough 2007; Tallavaara, Pesonen and Oinonen 2010; Timpson et al. 2014; Whitehouse et al. 2014). Others reject the method in general, criticise certain aspects or point out weaknesses (Ballenger and Mabry 2011; Bamforth and Grund 2012; Bayliss et al. 2007; Bleicher 2013; Chiverrell, Thorndycraft and Hoffmann 2011; Contreras and Meadows 2014; Crombé and Robinson 2014; Culleton 2008; Prates, Politis and Steele 2013; Steele 2010; Surovell and Brantingham 2007; Surovell et al. 2009; Torfing 2015; Williams 2012). A critical assessment is very helpful for improving a method or identifying its shortcomings. However, what many critics do not emphasize, and many supporters do not sufficiently consider, is that already in Ricks original paper (Rick 1987: 57–59, fig. 1) the essential sources of error (Intervening, Creation, Preservation, and Investigation biases) have been identified.

+

Due to the widespread use of the methodology in recent years, it is essential to explore the conditions for the meaningful use of sum calibration. Ideally, a catalogue of prerequisites and methodological requirements should be compiled providing a comparable and high standard for the usage of this estimator and a quantification of the uncertainty in its application.

+

This paper examines some of the objections with the help of simulations. This relates in particular to the methodological questions related to the sensitivity of the method as put forward by Contreras and Meadows (2014). Similar to that paper, it is examined whether 14C summations can be used to identify patterns that can be related to population fluctuations in the past. Thereby it is assumed that the amount of archaeological and thus datable material reflects those changes in particular. The sampling bias and its effects will be addressed. For this purpose, simulated and thus artificial 14C dates are generated, drawing from a probability curve based on a historical event – the Black Plague. For different data densities (average number of samples per year), a random sample of years is drawn from the period in question. The probability of each year corresponds to the relative demographic trend for the same year. This means, that the probability of drawing a sample from a particular date is exactly proportional to the population size in that year. The data simulated in this way is treated using Oxcal in the same way as the data is processed in existing studies (i.e. using Oxcal’s Sum command). It is then checked whether the given patterns can be found in the calibration results. It should be noted that, as in the above-mentioned study, no further measures against other sources of error apart from the sampling bias are taken. Above all, no binning is used to standardise the data per site, which has now been established as a standard procedure. On the one hand, such an error does not exist in the simulated data, on the other hand, the method used should be as close as possible to that of the paper by Contreras and Meadows (2014). The main focus is on enriching the unquantified results of that study with a quantification of the detection probability.

+
+
+

2 Background

+
+
+ +

+Figure 2.1: Interdependency between amount of information, intensity of pattern and desired uncertainty. +

+
+

The potential of using SPDs to assert a statement about past (perhaps demographic) processes depends on three factors (see 2.1):

+ +

If a very strong signal is to be detected, less data may be sufficient to be able to identify it with a specified uncertainty. Conversely, increased certainty about the validity of the signal requires either more data or a stronger signal. This means that strong demographic fluctuations in the past can be detected with greater certainty even based on smaller amounts of 14C datings, whereby more data would be required in the case of weaker fluctuations. These relationships are fundamental to any kind of statistical hypothesis test.

+
+Population development during the time of the Black Plague, according to Contreras and Meadows [-@contreras_summed_2014]. +

+Figure 2.2: Population development during the time of the Black Plague, according to Contreras and Meadows (2014). +

+
+

In their article, Contreras and Meadows (2014) worked on this question. They put all other possible methodological problems to one side (although they did elaborate on them in detail) and investigated how well simulated demographic changes can be tracked by their effect on simulated 14C data. The paper went much further than others due to its simulation approach and its results are therefore largely transparent. This is a very valuable and useful contribution to the debate. Their main case study is the Black Plague, whose demographic influence can be adequately understood from written sources. They describe their demographic example as: ‘In our population curve, after rising relatively steadily for the first three centuries of this period, population declined abruptly between AD 1310 (87 million) and 1350 (71 million), and further declined to 67 million by AD 1415, before recovering to 79 million(AD 1451) and finally overtaking its pre-Black Death peak in c. AD1550’ (Contreras and Meadows 2014, comp. also Fig. 2.2) However, the details of the setting are irrelevant for the specific investigation, it could have been any arbitrary or randomly generated example. The chosen one serves the authors above all to show that such a devastating event as the Black Plague could remain undiscovered by 14C summations.

+

When using an over-representatively high number of data, or such a data density, with 1000 data for a period of 1000 – 1700 BCE (density 1.43 data per year), the Black Plague would basically emerge (Contreras and Meadows 2014: 3). However, they argue that but the strength of the event in the resulting signal could not be attributed to such a disaster without prior knowledge (Contreras and Meadows 2014: 599). At a density they consider to be closer to the archaeological reality – the authors assume this to be 0.29, representing 200 data for 700 years – the sampling effect would prevent the underlying demographic processes from being properly represented by the simulated 14C data (Contreras and Meadows 2014: 6). They write: ‘Not only is the departure of these curves from the population distribution from which they are derived evident; the variability between samples is also notable: the most prominent fluctuations in each curve are not visible in most of the others’ (Contreras and Meadows 2014: 601). In general, the data density is decisive for the effectiveness of this estimator, whereby even with the maximum simulated number of dates (2000) the Black Death is ‘far from obvious’ as an event (Contreras and Meadows 2014: 602). In addition, they argue, the temporal fixation of the event is problematic due to the scatter effect especially of legacy data with high standard deviation. Thus it would be not possible to separate signal from noise, to separate false-positive and false-negative from real results, and to identify the exact timing and magnitude of the underlying phenomenon (Contreras and Meadows 2014: 603–605). In their concluding remarks they consequently state that ‘even under ideal conditions, it is difficult to distinguish between real and spurious population patterns, or to accurately date sharp fluctuations, even with data densities much higher than in most published attempts’ (Contreras and Meadows 2014: 605).

+

With all the importance that the simulation approach adds to this paper, unfortunately, the authors do not use its full potential. Although creating different scenarios of data density, each is only examine with five simulation runs (for 200, 1000 and 2000 samples respectively) (Contreras and Meadows 2014: 596). Even if five is more than one, this certainly does not represent a statistically reliable basis for a far-reaching statement. In addition, they state as paraphrased above, that the Black Plague could have remained undetected, without further specification or quantification. A significantly higher number of simulations might be mandatory for such a statement. A very important step in this direction has already been taken by (McLaughlin 2019), who has reviewed the Black Plague scenario in his article on using the KDE model for similar analyses, and who has already come up with detection rates. A perfect pattern recognition was achieved with a sample number of 3000. Here, however, only 30 simulation runs were checked in each case, and the effect strength was not varied.

+

Precisely against this background the triangle of effect strength, data quantity and certainty of identification should be quantified here. Using the same basic pattern, the Black Plague, the aim is to determine, for different scenarios of effect strength and data quantity, in how many of cases such a demographic catastrophe could have remained undetected. It is primarily a question of false-negative results. False positives can be meaningfully detected by other simulation approaches, as it has been discussed elsewhere (eg. Shennan et al. 2013; Edinborough et al. 2017) and as it will be applied in a later step (see below).

+
+
+

3 Methods

+

The overall approach and the implemented workflow consists of three main parts:

+
    +
  1. The simulation of the 14C data from the underlying population curve,
  2. +
  3. the identification of the signal from the resulting summation curve, and
  4. +
  5. the combination of the results from the individual simulation runs.
  6. +
+

To simulate different densities of 14C dates, 18 scenarios were created (30–90 in steps of ten, 100–900 in steps of one hundred, 1000–2000 in steps of one thousand). For each scenario, 200 simulation runs were used. The whole process is controlled by a superimposed control structure.

+

In the first part of the analysis, the original scenario of Contreras and Meadows (2014) was reconstructed. The population curve was reconstructed and for different numbers of simulated samples, the signal was detected as described below. This process was repeated 200 times for each parameterization of the number of samples in order to obtain a statistical basis for the evaluation. The proportion of detected patterns was recorded, and the scenarios themselves were repeated 200 times to capture the range of variation between runs. Although the scattering of the detection results with respect to the standard deviation of the successful detection is primarily a function of the sample size (200 repetitions) and the true detection rate, this exhaustive test setup was chosen in order to account for any nonlinear effects resulting from the shape of the calibration curve. This resulted in 720,000 individual simulation runs (200 batches of 200 simulations of 18 scenarios).

+

The signal strength, i.e. the intensity with which the demographic signal decreases, is 77.4% in the ‘real’ data of the Black Plague. In the second part of the analysis, signal strengths of 30%-90% were simulated in steps of ten, respectively the data set of the Black Plague was changed in such a way that such a demographic change is predetermined by the data set. This results in a total of 126 scenarios. For each of the scenarios, 200 simulation runs were carried out, resulting in a total of 25,200 individual runs. The repetition of individual scenarios was omitted as this would have considerably increased the runtime of the algorithm.

+

This process was repeated for both settings including the test against false positives as described below. In total, the whole simulation includes 1,490,400 individual sum calibrations. The choice for the final number of runs and repetitions resulted from the total run time, which was 94480 seconds or 26 hours and 15 minutes (using parallel computing on 6 cores of an Intel(R) Xeon(R) CPU E3-1240 v5 at 3.50GHz with 16 GB RAM).[^1] +[^1]: In the course of the review of this paper, both reviewers independently suggested that the simulation should be performed for other temporal positions in order to check whether the results of the signal detection are robust to artifacts in the calibration curve. I do not see per se any methodological reasons why this would lead to significantly different results, since the curve in this period is quite comparable e.g. with that of the later Neolithic (a plateau between 1100 - 1200 CE and a wiggle between 1300 - 1400 CE, comparable e.g. with a wiggle between 3500 - 3400 BCE and a plateau between 3300 - 3100 BCE). Nevertheless, this is an interesting starting point for a possible further paper, but it would go beyond the scope of the analysis presented here.

+
+

3.1 Simulation of the 14C dates

+

For the simulation of the 14C data, the original curve from Contreras and Meadows (2014) was used. The data was converted into numerical values by digitizing (using the software Engauge). The corresponding data set is attached as supplementary material or can be accessed in the reproducible analysis.

+

The population numbers were then interpolated by linear approximation on an annual basis and converted into a probability distribution by normalization to the sum of 1. This distribution then served as weighting for a random drawing of calendar dates representing the individual sample. The sample size was defined as a scenario based on the given parameterisation (see above). In the second part of the analysis, this distribution was changed by parameterising the signal strength by linear rescaling in such a way that the drop from the peak before the demographic signal to the minimum of the curve corresponds to the given signal strength.

+

The random years obtained in this way, whose frequency corresponds to the given population curve of the Black Plague, were then processed as a sum calibration using C_Simulate and Sum and calibrated via OxCal (using the package oxcAAR, Hinz et al. 2018). As already in (Contreras and Meadows 2014), the standard deviation was randomly sampled equally distributed in the range of 20-40 years.

+
+Comparing the same result of a random sum calibration unsmoothed (left) and smoothed (right) with a 2-sided smoothing window of total 500 years. +

+Figure 3.1: Comparing the same result of a random sum calibration unsmoothed (left) and smoothed (right) with a 2-sided smoothing window of total 500 years. +

+
+

The smoothing of the resulting calibration result with a moving average, as suggested by (Williams 2012) with a window of 500 years minimum, was considered, but rejected again. The reason for this is that the more turbulent curve of the calibration result produces a more realistic scenario (see fig. 3.1).

+
+
+

3.2 Detection of the signal

+
+Four examples of rejected results (signal not detected) using the original signal strength and 200 dates. Orange Area: where a minimum should be present. Blue Area: Where the signal should be at least 10% higher than in the minimum on average. +

+Figure 3.2: Four examples of rejected results (signal not detected) using the original signal strength and 200 dates. Orange Area: where a minimum should be present. Blue Area: Where the signal should be at least 10% higher than in the minimum on average. +

+
+
+Four examples of accepted results (signal detected) using the original signal strength and 200 dates. Orange Area: where a minimum should be present. Blue Area: Where the signal should be at least 10% higher than in the minimum on average. +

+Figure 3.3: Four examples of accepted results (signal detected) using the original signal strength and 200 dates. Orange Area: where a minimum should be present. Blue Area: Where the signal should be at least 10% higher than in the minimum on average. +

+
+

To achieve an automated detection of the signal in the calibration result, an algorithm was written which performs this task. The local minima between 1210 and 1630 were recorded and the strongest minimum was selected. If this was not in the period between 1310 and 1530, i.e. the minimum in the population curve of the Black Plague, the result was discarded as non-match. It was then tested whether this minimum was at least 10% below the mean of the 100 years preceding and following the event with a lag of 50 years (1260 resp. 1580). Only if this was the case the signal was considered as detected. A selection of random examples of accepted and rejected calibration results can be found in Figure 3.2 resp. 3.3, or can be easily generated using the reproducible code itself.

+
+
+

3.3 Combination of the results

+

The results of the individual runs were recorded and stored in tabular form. For the first part of the analysis, fixing the signal strength to the value corresponding to the original examination (Contreras and Meadows 2014), the number of detections per run, normalized to the total runs, was recorded. For the second part, since only one run with 200 repetitions was performed per scenario, only one value was recorded for each scenario.

+

Accordingly, mean value, standard deviation, internal quartile and 95% interval can be calculated for the original scenario. For the second part, on the other hand, only one value per scenario is shown, but the influences of sample size and signal strength can be calculated individually.

+
+
+

3.4 Elimination of false positive results

+

Shennan et al. (2013) used a Monte Carlo simulation method that produces simulated data distributions under an adjusted null model. These are then used to test characteristics in the observed data set for statistically significant patterns. A large number of individual simulations are carried out using the null model as the population curve, similar to the simulation technique described above. The interval in which the simulated data ranges reflects the element of random sample distribution. Since the 5% significance boundary is set as the statistical standard, the 95% interval (i.e. the quantiles 0.025 and 0.975) is usually taken from the simulated data. A signal, to be evaluated as significant and thus ‘real’, must lie outside this fluctuation range.

+

This approach, with slightly different settings, has since become established as the standard procedure for checking the patterns detected in sum calibrations. While, for example, Shennan et al. (2013) uses an exponential generalized linear model for the null model, which is adapted to the data, a simpler approach is chosen here as in other publications (Hinz et al. 2019). The null model is a uniform distribution of the data within a specific time window. Thus, no assumption about a possible population development is made in advance, as would be the case with an exponential function in the sense of population growth. With this, I assume a stable population, and those events, which fall out of the hull generated by the simulation, can be considered as significantly different from this null model. A specific helper function is implemented in the package oxcAAR (Hinz et al. 2018) (oxcalSumSim()), which can be used to easily perform such a simulation. It has to be noted that this function is based on R_Simulate of OxCal, and therefore shows rather wider uncertainty ranges than it would be necessary for C_Simulate. In the given context, this rather increases the robustness of the estimation.

+

For the original methodology of Shennan et al. (2013) an extension has recently been proposed (Edinborough et al. 2017), which allows a more local and specific approach to hypothesis testing with respect to sum calibration. This expansion will not be further explored in the following, even though it has been successfully applied to the Black Death scenario. The reason is that in this paper I am mainly interested in the general detectability even in the absence of previous knowledge (as it may be available from literary sources), and therefore prefer the simplest possible parameterization.

+
+
+

3.5 Reproducible Research in Simulation studies

+

Reproducibility has not yet become the standard for archaeological analysis. In many cases, the way archaeological data are collected prevents complete reproducibility of results, as an excavation can only be carried out once. However, in the case of derived, secondary analyses, reproducibility is clearly a preferable design consideration in any research. This is all the more true for simulation studies, which naturally rely on random effects and should therefore be reproducible in their parameterization and which also create the perfect conditions for such a research design regarding their data base.

+

Unfortunately, especially in the field of summed 14C analyses, it is often the case that the argumentation relates on single observations or single calibration runs, i.e. only few results are presented pars pro toto. At the same time, the source code used to generate these numbers is usually not included in the paper and is also not accessible elsewhere. Therefore the results must be believed as argumentum ab auctoritate. A listing of related papers is deliberately omitted here.

+

If the source code is available or at least reconstructable (as in Contreras and Meadows 2014), a big step towards reproducibility has already been taken. In this article I try to go one step further and choose an Open Science approach in the sense of reproducible research (in the sense of Marwick 2017). The code underlying the simulations is made available together with the article, based on the package rrtools (https://github.com/benmarwick/rrtools). It is available as an R package (sensitivity.sumcal.article.2020) and can be obtained directly (https://github.com/MartinHinz/sensitivity.sumcal.article.2020) or from a repository (Zenodo, doi: 10.5281/zenodo.3613674). With this, all results should be easily reproducible and verifiable, especially the settings of the simulation should be available for direct verification.

+
+
+
+

4 Results

+
+

4.1 Original Setup

+

number of samples

sample density (per years)

mean proportion signal detected

standard deviation proportion signal detected

inner quartiles

95% quantiles

30

0.043

0.611

0.035

0.585 – 0.635

0.545 – 0.68

40

0.057

0.622

0.035

0.595 – 0.645

0.555 – 0.69

50

0.071

0.630

0.033

0.61 – 0.65

0.57 – 0.69

60

0.086

0.641

0.037

0.615 – 0.666

0.575 – 0.71

70

0.100

0.643

0.033

0.62 – 0.665

0.58 – 0.705

80

0.114

0.645

0.037

0.62 – 0.666

0.57 – 0.72

90

0.129

0.656

0.037

0.635 – 0.68

0.585 – 0.73

100

0.143

0.659

0.034

0.63 – 0.685

0.6 – 0.72

200

0.286

0.689

0.035

0.665 – 0.71

0.63 – 0.76

300

0.429

0.711

0.030

0.69 – 0.735

0.655 – 0.765

400

0.571

0.719

0.032

0.695 – 0.74

0.655 – 0.78

500

0.714

0.725

0.029

0.71 – 0.74

0.67 – 0.78

600

0.857

0.723

0.032

0.7 – 0.745

0.66 – 0.785

700

1.000

0.720

0.032

0.7 – 0.74

0.655 – 0.775

800

1.143

0.722

0.033

0.695 – 0.745

0.665 – 0.78

900

1.286

0.728

0.030

0.705 – 0.75

0.675 – 0.78

1000

1.429

0.721

0.029

0.7 – 0.74

0.665 – 0.77

2000

2.857

0.714

0.034

0.69 – 0.735

0.64 – 0.77

+ + +
+Table 4.1: Results from the simulation (200 runs for each number of samples) of the original setup of (Contreras and Meadows 2014). +
+

The results of the reproduction of the original scenario can be seen in Table 4.1. For the situation of 1000 samples for 700 years described by the authors as super-ideal (results in a density of 1.43) a detection rate of 72.1% results. In half of the cases, the value was between 70% and 74%, 95% of the values lay between 66.5% and 77%.

+

For the case of a sample size of 200 Contreras and Meadows (2014: 601) estimated as realistic, the mean detection rate is 68.9%, with the inner quartile between 66.5% and 71% and the 95% interval between 63% and 76%.

+

Thus, the estimation of Contreras and Meadows (2014) was not completely unjustified. The signal could have been overlooked, following the original simulation setup, with a probability of 1/3. The detection chance seems to be relatively independent of the sample size (once the sample density has surpassed 300).

+
+The results of the simulation of the original setup with 200 runs for each number of samples, visualised as boxplot (comp. tab. \@ref(tab:orig-sim-result-table)). +

+Figure 4.1: The results of the simulation of the original setup with 200 runs for each number of samples, visualised as boxplot (comp. tab. 4.1). +

+
+
+The results of the simulation of the original setup with 200 runs for each number of samples, visualised as plot with smoothed trendline (comp. tab. \@ref(tab:orig-sim-result-table)). Please note that the x-values are slightly jittered for better recognision of the individual dates, and the x-axis is logarithmic. +

+Figure 4.2: The results of the simulation of the original setup with 200 runs for each number of samples, visualised as plot with smoothed trendline (comp. tab. 4.1). Please note that the x-values are slightly jittered for better recognision of the individual dates, and the x-axis is logarithmic. +

+
+

This can be seen in table 4.1, more clearly perhaps from the box plot of the results (Fig. 4.1) or the representation as regression (with logarithmic x-axis, Fig. 4.2). Up to a sample size of about 300, corresponding to a density of 0.43 dates per year, the detection rate improves and then remains in a plateau.

+

The results indicate that, on the one hand, there is a clear chance to detect an event like the Black Plague with a tool like sum-calibrated 14C data, if we leave aside the discussion of other methodological problems at this point. On the other hand, the sample size seems to have less influence on improving detection at some stage. Thus the systematic application of the simulation experiment of Contreras and Meadows (2014) cannot confirm the interpretations that they themselves deduce from their results. In this setup, it is not the number of samples that leads to a significant improvement in the detection rate. It is true that the individual results of individual sum calibrations deviate significantly from the given curve of the underlying population. But through formalized detection with fixed parameters, it is still possible to detect events within the given time window with a relatively high probability. Before we turn to the question of what exactly these events represent and how well we can separate false positive from true positive results (section 4.3), the influence of signal strength should be examined.

+
+
+

4.2 Altering Signal Strength

+

number of samples

sample density

0.3

0.4

0.5

0.6

0.7

0.8

0.9

30

0.043

0.805

0.760

0.680

0.680

0.700

0.550

0.605

40

0.057

0.805

0.775

0.700

0.680

0.640

0.610

0.615

50

0.071

0.860

0.860

0.720

0.775

0.660

0.605

0.600

60

0.086

0.905

0.820

0.745

0.705

0.645

0.635

0.510

70

0.100

0.880

0.820

0.790

0.750

0.695

0.625

0.545

80

0.114

0.855

0.815

0.785

0.735

0.675

0.645

0.595

90

0.129

0.850

0.855

0.820

0.720

0.650

0.675

0.585

100

0.143

0.950

0.885

0.835

0.755

0.685

0.640

0.610

200

0.286

0.980

0.950

0.880

0.845

0.730

0.685

0.620

300

0.429

1.000

0.990

0.920

0.885

0.840

0.680

0.565

400

0.571

1.000

0.990

0.975

0.900

0.820

0.645

0.595

500

0.714

1.000

1.000

0.985

0.900

0.835

0.680

0.535

600

0.857

1.000

1.000

0.975

0.945

0.795

0.635

0.505

700

1.000

1.000

1.000

0.990

0.935

0.845

0.705

0.510

800

1.143

1.000

1.000

0.990

0.955

0.810

0.685

0.505

900

1.286

1.000

1.000

0.990

0.970

0.850

0.730

0.485

1000

1.429

1.000

1.000

0.995

0.975

0.910

0.730

0.435

2000

2.857

1.000

1.000

1.000

0.985

0.885

0.615

0.365

+ + +
+Table 4.2: Results from the simulation of different signal strengths. +
+

In the second part of the analysis, the intensity of the signal, as described above, was parameterised differently in order to check the influence of a stronger or weaker signal and thus be able to predict the detection possibilities of demographic changes of different intensity. Figure (Fig. 4.2) shows the mean detection rates for different scenarios. The signal strength originally used of 77.81 corresponds most closely to 0.8 in this parameterisation, which in this simulation leads to an average detection rate of 0.685 for 200 data or 0.73 for 1000 data. The results are therefore generally comparable with the reconstruction of the original simulation.

+
+The results of the simulation of different signal strengths with 100 runs for each number of samples (comp. tab. \@ref(tab:full-sim-result-table)). Please note that the x-axis is logarithmic. +

+Figure 4.3: The results of the simulation of different signal strengths with 100 runs for each number of samples (comp. tab. 4.2). Please note that the x-axis is logarithmic. +

+
+

It is obvious that the strength of the signal has a high influence on the detection rate (Fig. 4.3). Signals resulting from an underlying population reduced to 70% or less have a significantly higher detection rate, especially with higher sample numbers.

+

If the relationship between detection rate, sample size and signal strength is considered a linear model (see Appendix A.1.), then both factors are significant predictors for the detection rate, signal strength (coefficient of 8.89e-05 with a p-value of 3.56e-08) is clearly more dominant than the sample size (coefficient of -6.53e-01 with a p-value of 3.78e-35).

+

It can be seen that a signal strength of 90% (corresponds to a reduction of 10%) with a small number of samples also shows a detection rate of more than 50%. This is rather surprising since the minimum difference necessary for recognition in the detection algorithm is set to 0.1. It is also surprising that this detection rate drops significantly with larger sample sizes (Fig. 4.3). This is a strong indication that false-positive signals, which result exclusively from the random distribution of the data and not from the underlying pattern, are also counted here. This touches one of the key questions posed by Contreras and Meadows (2014): Is it possible to distinguish real signals from false positives? To evaluate this, in a third step the same analysis was performed with the inclusion of a confidence envelope for false-positive signals.

+
+
+

4.3 Results with testing for false positive

+

number of samples

sample density

0.3

0.4

0.5

0.6

0.7

0.8

0.9

30

0.043

0.245

0.230

0.175

0.130

0.155

0.185

0.130

40

0.057

0.225

0.255

0.165

0.135

0.190

0.110

0.105

50

0.071

0.280

0.285

0.230

0.185

0.130

0.140

0.125

60

0.086

0.300

0.245

0.195

0.140

0.185

0.150

0.110

70

0.100

0.425

0.270

0.225

0.230

0.200

0.140

0.090

80

0.114

0.425

0.330

0.210

0.260

0.130

0.115

0.135

90

0.129

0.415

0.350

0.185

0.200

0.140

0.145

0.100

100

0.143

0.560

0.390

0.275

0.195

0.170

0.145

0.090

200

0.286

0.810

0.590

0.495

0.350

0.215

0.175

0.120

300

0.429

0.915

0.735

0.610

0.390

0.255

0.140

0.095

400

0.571

0.970

0.900

0.770

0.480

0.325

0.230

0.125

500

0.714

0.980

0.950

0.830

0.635

0.310

0.280

0.125

600

0.857

0.995

0.980

0.855

0.625

0.380

0.260

0.140

700

1.000

1.000

0.985

0.895

0.680

0.525

0.240

0.105

800

1.143

1.000

0.985

0.970

0.770

0.495

0.285

0.185

900

1.286

1.000

1.000

0.955

0.820

0.550

0.270

0.160

1000

1.429

1.000

1.000

0.960

0.855

0.625

0.360

0.110

2000

2.857

1.000

1.000

1.000

0.970

0.695

0.380

0.110

+ + +
+Table 4.3: Results from the simulation of different signal strengths under consideration of the removal of false positive results. +
+
+The results of the simulation of different signal strengths with 100 runs for each number of samples under consideration of the removal of false positive results (comp. tab. \@ref(tab:envelope-sim-result-table)). Please note that the x-axis is logarithmic. +

+Figure 4.4: The results of the simulation of different signal strengths with 100 runs for each number of samples under consideration of the removal of false positive results (comp. tab. 4.3). Please note that the x-axis is logarithmic. +

+
+

In the same manner like the results above, Tab. 4.3 and Fig. 4.4 visualise the effect of removal of false-positive pattern (section 3.4). In this version, the results for weak signals remain at a low level, while those for strong signals rise sharply from a sample size of about 200. For all signal strengths greater than 0.6, at the latest for a sample size of 300 or more, these exceed the 50% mark. This implies that this method produces a much more reliable result and is a strong indicator of the effectiveness of this approach. The overall detection rate is significantly reduced, and it becomes clear that for reliable identification of events a much higher sample size is necessary than if the possible false positives are naively ignored.

+

number of samples

sample density

detection rate

30

0.043

0.145

40

0.057

0.165

50

0.071

0.145

60

0.086

0.120

70

0.100

0.100

80

0.114

0.130

90

0.129

0.140

100

0.143

0.115

200

0.286

0.210

300

0.429

0.160

400

0.571

0.235

500

0.714

0.285

600

0.857

0.170

700

1.000

0.320

800

1.143

0.365

900

1.286

0.330

1000

1.429

0.325

2000

2.857

0.395

3000

4.286

0.720

4000

5.714

0.710

5000

7.143

0.715

6000

8.571

0.705

+ + +
+Table 4.4: Results from the simulation of the original setup of (Contreras and Meadows 2014) under consideration of the removal of false positive results. +
+
+Results from the simulation of the original setup of [@contreras_summed_2014] under consideration of the removal of false positive results (comp. tab. \@ref(tab:envelope-orig-sim-result-table)). Please note that the x-axis is logarithmic. +

+Figure 4.5: Results from the simulation of the original setup of (Contreras and Meadows 2014) under consideration of the removal of false positive results (comp. tab. 4.4). Please note that the x-axis is logarithmic. +

+
+

Finally, this combined method was applied again to the original example of the Black Plague with its fixed signal strength. The table 4.4 and Fig. 4.5 show the corresponding results. The scope of the sample size was extended upwards. Considering possible false-positive results, the detection rate for this pattern is quite low. For the scenario with 200 data, corresponding to a density of 0.29 per year, there is a detection rate of 0.21, for a sample size of 1000, corresponding to a density of 1.43 per year, the detection rate reaches 0.21. Only with a sample size of 2000, (density = 2.86), a more or less reliable identification can be assumed.

+
+
+
+

5 Discussion

+

When using estimators for reconstructions, it is clear that in addition to the existing uncertainties, the variability in the relationship of the estimator to the estimated variable has to be considered. Therefore, it is unrealistic to expect a perfect reconstruction. Nevertheless, the question of Contreras and Meadows (2014) is, of course, justified as to whether such a disruptive event as the Black Plague could have been recognised through this methodology. Therefore, the example is very well chosen and was used here for the same reason. The answer to this question must be ‘yes’, even if one has to limit, ‘not in every case’, or more precisely, ‘with 68.9% probability’, given the original setup of their analysis.

+

The other question raised by that paper is to what extent false positives can be distinguished from real signals. Here the approach of a hypothesis test based on bootstrapping was applied. This proved to be very capable of filtering out false positive signals due to their lower magnitude. On the other hand, when this parameterisation is applied, an average of 5% remains, which is not recognized as false positive. However, this is certainly by far an order of magnitude with which a science such as archaeology can operate, since very few approaches in our discipline offer 95 per cent confidence.

+

The detection of the event cannot be assumed to be absolutely certain. To the question of false-positive detections, the method of producing a confidence interval by simulation of an equal distribution (e.g. Shennan et al. 2013; Hinz et al. 2019) has been introduced into this simulation, going beyond the original setup of Contreras and Meadows. So if the question is how well we can identify this event, taking random fluctuations into account, the result is much more unfavourable. Only at a relatively high temporal density of 14C dates, a reliable detection becomes realistic.

+

A false-negative result, or in other words a Type II error, might be considered less dramatic in the given situation than a Type I error where an event would be identified that is non-existent. While there is some serious discussion as to which type I error would be worse and if there are any worse errors at all - this affects situations in which a type II error would lead to wrong decisions regarding e.g. the safety of patient treatment (see e.g. Carlson 2017: 169–170). In this specific case, the methodology opens up the chance of detecting an event at the risk of not detecting it. In this instance, one should not assume that it may have been an uneventful period. This demands above all the necessity to not only reconstruct the past with one estimator but to validate different indicators mutually and to evaluate them in multi-proxy approaches.

+

The results raise questions about the nature of the conclusions already made using 14C sum calibration. These results demonstrate distributions of 14C data on the temporal axis different from the results of random sampling processes. The simulation results in this study clearly support the assumption that the significance test based on a Monte Carlo simulation, in one form or another, is very well suited to filter out signals that only appear in the data due to sampling effects. Therefore, significant results are highly likely to show variations in the data background. The basic question is therefore not of a statistical nature, lies not in the insensitivity of the estimator himself, but rather in the fundamental methodological question of what information the absence of archaeological material at a certain period can provide us with. If, as in the case of the analysis of Shennan et al. (2013), we see a reduction of 36% (signal strength of 0.64) and take it literally (estimated at Shennan et al. 2013: 4, on 200 years rolling mean, with a peak at about 3500 BCE and a minimum at about 3000 BCE), is this a population change that we can assume to be realistic for a prehistoric population? If we take the estimates of (Müller 2015) into account, we are talking about 5 million people in Europe during this period. These would be reduced to about 3.2 million over 500 years. Another question is what true signal strength is indicated by an observed signal strength of 0.64, and how high the uncertainty range of such an estimate is. The latter estimate goes beyond the scope of this study but is an excellent question for another simulation-based study.

+

The main problem in using summation calibration as a means of demographic reconstruction is therefore not the statistical conditions. If the sample size and sensitivity are too small, there are possibilities in this domain to identify such problems and to counteract them if necessary. The main problem lies rather in the often biased distribution by the production of the data in studies often carried out with specific scientific objectives as well as in the fact that often quite different deposition processes are treated equally. This is the original approach of Rick (1987), in which the number of data was largely equated directly with the intensity of human activity, even when he has already identified these biases. Therefore it is necessary to find appropriate countermeasures and to establish a best-practise catalogue for such investigations. An assessment of how strong a signal can be detected with what data density can be a valuable first step in direction of such a standardisation.

+
+
+

6 Conclusion

+

In this article a simulation approach was used to move beyond the simple statement ‘the black plague could have remained undiscovered by 14C sum calibration’ and to arrive at a quantification of the probability of detection and a prediction of the detection potential of other, more or less pronounced events. As a result, it could be shown that no guarantee can be given for detection by this method, but that the chances will outweigh the risks.

+
+
+

7 References

+ +
+
+

Armit, I, Swindles, G T and Becker, K 2013 From dates to demography in later prehistoric Ireland? Experimental approaches to the meta-analysis of large 14C data-sets, Journal of Archaeological Science, 40(1): 433–438. DOI: https://doi.org/10.1016/j.jas.2012.08.039.

+
+
+

Ballenger, J A M and Mabry, J B 2011 Temporal frequency distributions of alluvium in the American Southwest: Taphonomic, paleohydraulic, and demographic implications, Journal of Archaeological Science, 38(6): 1314–1325. DOI: https://doi.org/10.1016/j.jas.2011.01.007.

+
+
+

Bamforth, D B and Grund, B 2012 Radiocarbon calibration curves, summed probability distributions, and early Paleoindian population trends in North America, Journal of Archaeological Science, 39(6): 1768–1774. DOI: https://doi.org/10.1016/j.jas.2012.01.017.

+
+
+

Bayliss, A, Bronk Ramsey, C, Plicht, J van der and Whittle, A 2007 Bradshaw and Bayes: Towards a Timetable for the Neolithic, Cambridge Archaeological Journal, 17(Supplement S1): 1–28. DOI: https://doi.org/10.1017/S0959774307000145.

+
+
+

Bleicher, N 2013 Summed radiocarbon probability density functions cannot prove solar forcing of Central European lake-level changes, The Holocene, 23(5): 755–765. DOI: https://doi.org/10.1177/0959683612467478.

+
+
+

Buchanan, B, Collard, M and Edinborough, K 2008 Paleoindian demography and the extraterrestrial impact hypothesis, Proceedings of the National Academy of Sciences, 105(33): 11651–11654. DOI: https://doi.org/10.1073/pnas.0803762105.

+
+
+

Carlson, D L 2017 Quantitative Methods in Archaeology Using R. Cambridge Manuals in Archaeology. Cambridge University Press. DOI: https://doi.org/10.1017/9781139628730.

+
+
+

Chiverrell, R C, Thorndycraft, V R and Hoffmann, T O 2011 Cumulative probability functions and their role in evaluating the chronology of geomorphological events during the Holocene, Journal of Quaternary Science, 26(1): 76–85. DOI: https://doi.org/10.1002/jqs.1428.

+
+
+

Collard, M, Ruttle, A, Buchanan, B and O’Brien, M J 2013 Population Size and Cultural Evolution in Nonindustrial Food-Producing Societies, PLOS ONE, 8(9): e72628. DOI: https://doi.org/10.1371/journal.pone.0072628.

+
+
+

Contreras, D A and Meadows, J 2014 Summed radiocarbon calibrations as a population proxy: A critical evaluation using a realistic simulation approach, Journal of Archaeological Science, 52: 591–608. DOI: https://doi.org/10.1016/j.jas.2014.05.030.

+
+
+

Crombé, P and Robinson, E 2014 14C dates as demographic proxies in Neolithisation models of northwestern Europe: A critical assessment using Belgium and northeast France as a case-study, Journal of Archaeological Science, 52: 558–566. DOI: https://doi.org/10.1016/j.jas.2014.02.001.

+
+
+

Culleton, B J 2008 Crude demographic proxy reveals nothing about Paleoindian population, Proceedings of the National Academy of Sciences, 105(50): E111–E111. DOI: https://doi.org/10.1073/pnas.0809092106.

+
+
+

Edinborough, K, Porčić, M, Martindale, A, Brown, T J, Supernant, K and Ames, K M 2017 Radiocarbon test for demographic events in written and oral history, Proceedings of the National Academy of Sciences, 114(47): 12436–12441. DOI: https://doi.org/10.1073/pnas.1713012114.

+
+
+

Gamble, C, Davies, W, Pettitt, P, Hazelwood, L and Richards, M 2005 The Archaeological and Genetic Foundations of the European Population during the Late Glacial: Implications for, Cambridge Archaeological Journal, 15(02): 193–223. DOI: https://doi.org/10.1017/S0959774305000107.

+
+
+

Gkiasta, M, Russell, T, Shennan, S and Steele, J 2003 Neolithic transition in Europe: The radiocarbon record revisited, Antiquity, 77(295): 45–62. DOI: https://doi.org/10.1017/S0003598X00061330.

+
+
+

Hinz, M, Feeser, I, Sjögren, K-G and Müller, J 2012 Demography and the intensity of cultural activities: An evaluation of Funnel Beaker Societies (4200-2800 cal BC), Journal of Archaeological Science, 39(10): 3331–3340. DOI: https://doi.org/10.1016/j.jas.2012.05.028.

+
+
+

Hinz, M, Schirrmacher, J, Kneisel, J, Rinne, C and Weinelt, M 2019 The Chalcolithic–Bronze Age transition in southern Iberia under the influence of the 4.2 kyr event? A correlation of climatological and demographic proxies, Journal of Neolithic Archaeology, 21: 1–26–1–26. DOI: https://doi.org/10.12766/jna.2019.1.

+
+
+

Hinz, M, Schmid, C, Knitter, D and Tietze, C 2018 oxcAAR: Interface to ’OxCal’ Radiocarbon Calibration.

+
+
+

Hoffmann, T, Lang, A and Dikau, R 2008 Holocene river activity: Analysing 14C-dated fluvial and colluvial sediments from Germany, Quaternary Science Reviews, 27(21–22): 2031–2040. DOI: https://doi.org/10.1016/j.quascirev.2008.06.014.

+
+
+

Johnstone, E, Macklin, M G and Lewin, J 2006 The development and application of a database of radiocarbon-dated Holocene fluvial deposits in Great Britain, CATENA, 66(1–2): 14–23. DOI: https://doi.org/10.1016/j.catena.2005.07.006.

+
+
+

Kelly, R L, Surovell, T A, Shuman, B N and Smith, G M 2013 A continuous climatic impact on Holocene human population in the Rocky Mountains, Proceedings of the National Academy of Sciences, 110(2): 443–447. DOI: https://doi.org/10.1073/pnas.1201341110.

+
+
+

Marwick, B 2017 Computational Reproducibility in Archaeological Research: Basic Principles and a Case Study of Their Implementation, Journal of Archaeological Method and Theory, 24(2): 424–450. DOI: https://doi.org/10.1007/s10816-015-9272-9.

+
+
+

McLaughlin, T R 2019 On Applications of Space–Time Modelling with Open-Source 14C Age Calibration, Journal of Archaeological Method and Theory, 26(2): 479–501. DOI: https://doi.org/10.1007/s10816-018-9381-3.

+
+
+

Müller, J 2015 8 Million Neolithic Europeans: Social Demography and Social Archaeology on the Scope of Change – from the Near East to Scandinavia. In: Neustupný, E. and Kristiansen, K. (eds.) Paradigm found : Archaeological theory - past, present and future : Essays in honour of Evžen Neustupný. Oxford: Oxbow Books. pp. 200–214.

+
+
+

Mulrooney, M A 2013 An island-wide assessment of the chronology of settlement and land use on Rapa Nui (Easter Island) based on radiocarbon data, Journal of Archaeological Science, 40(12): 4377–4399. DOI: https://doi.org/10.1016/j.jas.2013.06.020.

+
+
+

Prates, L, Politis, G and Steele, J 2013 Radiocarbon chronology of the early human occupation of Argentina, Quaternary International, 301: 104–122. DOI: https://doi.org/10.1016/j.quaint.2013.03.011.

+
+
+

Rick, J W 1987 Dates as Data: An Examination of the Peruvian Preceramic Radiocarbon Record, American Antiquity, 52(1): 55–73. DOI: https://doi.org/10.2307/281060.

+
+
+

Riede, F 2009 Climate and Demography in Early Prehistory: Using Calibrated 14C Dates as Population Proxies, Human Biology, 81(3): 309–338. DOI: https://doi.org/10.3378/027.081.0311.

+
+
+

Rieth, T M, Hunt, T L, Lipo, C and Wilmshurst, J M 2011 The 13th century polynesian colonization of Hawai’i Island, Journal of Archaeological Science, 38(10): 2740–2749. DOI: https://doi.org/10.1016/j.jas.2011.06.017.

+
+
+

Shennan, S 2009 Evolutionary Demography and the Population History of the European Early Neolithic, Human Biology, 81(2-3): 339–355. DOI: https://doi.org/10.3378/027.081.0312.

+
+
+

Shennan, S 2012 Demographic Continuities and Discontinuities in Neolithic Europe: Evidence, Methods and Implications, Journal of Archaeological Method and Theory, 20(2): 300–311. DOI: https://doi.org/10.1007/s10816-012-9154-3.

+
+
+

Shennan, S, Downey, S S, Timpson, A, Edinborough, K, Colledge, S, Kerig, T, Manning, K and Thomas, M G 2013 Regional population collapse followed initial agriculture booms in mid-Holocene Europe, Nature Communications, 4: 2486. DOI: https://doi.org/10.1038/ncomms3486.

+
+
+

Shennan, S and Edinborough, K 2007 Prehistoric population history: From the Late Glacial to the Late Neolithic in Central and Northern Europe, Journal of Archaeological Science, 34(8): 1339–1345. DOI: https://doi.org/10.1016/j.jas.2006.10.031.

+
+
+

Steele, J 2010 Radiocarbon dates as data: Quantitative strategies for estimating colonization front speeds and event densities, Journal of Archaeological Science, 37(8): 2017–2030. DOI: https://doi.org/10.1016/j.jas.2010.03.007.

+
+
+

Surovell, T A and Brantingham, P J 2007 A note on the use of temporal frequency distributions in studies of prehistoric demography, Journal of Archaeological Science, 34(11): 1868–1877. DOI: https://doi.org/10.1016/j.jas.2007.01.003.

+
+
+

Surovell, T A, Byrd Finley, J, Smith, G M, Brantingham, P J and Kelly, R 2009 Correcting temporal frequency distributions for taphonomic bias, Journal of Archaeological Science, 36(8): 1715–1724. DOI: https://doi.org/10.1016/j.jas.2009.03.029.

+
+
+

Tallavaara, M, Pesonen, P and Oinonen, M 2010 Prehistoric population history in eastern Fennoscandia, Journal of Archaeological Science, 37(2): 251–260. DOI: https://doi.org/10.1016/j.jas.2009.09.035.

+
+
+

Timpson, A, Colledge, S, Crema, E, Edinborough, K, Kerig, T, Manning, K, Thomas, M G and Shennan, S 2014 Reconstructing regional population fluctuations in the European Neolithic using radiocarbon dates: A new case-study using an improved method, Journal of Archaeological Science, 52: 549–557. DOI: https://doi.org/10.1016/j.jas.2014.08.011.

+
+
+

Torfing, T 2015 Neolithic population and summed probability distribution of 14C-dates, Journal of Archaeological Science, DOI: https://doi.org/10.1016/j.jas.2015.06.004.

+
+
+

Whitehouse, N J, Schulting, R J, McClatchie, M, Barratt, P, McLaughlin, T R, Bogaard, A, Colledge, S, Marchant, R, Gaffrey, J and Bunting, M J 2014 Neolithic agriculture on the European western frontier: The boom and bust of early farming in Ireland, Journal of Archaeological Science, 51: 181–205. DOI: https://doi.org/10.1016/j.jas.2013.08.009.

+
+
+

Williams, A N 2012 The use of summed radiocarbon probability distributions in archaeology: A review of methods, Journal of Archaeological Science, 39(3): 578–589. DOI: https://doi.org/10.1016/j.jas.2011.07.014.

+
+
+
+
+

8 Appendix

+
+

8.0.1 A.1. Linear Model of detection rate, sample size and signal strength

+
#> 
+#> Call:
+#> lm(formula = p_signal_detected ~ nsamples + signal_strength, 
+#>     data = data_for_lm)
+#> 
+#> Residuals:
+#>      Min       1Q   Median       3Q      Max 
+#> -0.36761 -0.04656  0.00544  0.05289  0.14101 
+#> 
+#> Coefficients:
+#>                   Estimate Std. Error t value Pr(>|t|)    
+#> (Intercept)      1.142e+00  2.454e-02  46.558  < 2e-16 ***
+#> nsamples         8.894e-05  1.512e-05   5.883 3.56e-08 ***
+#> signal_strength -6.530e-01  3.734e-02 -17.485  < 2e-16 ***
+#> ---
+#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
+#> 
+#> Residual standard error: 0.08384 on 123 degrees of freedom
+#> Multiple R-squared:  0.7345, Adjusted R-squared:  0.7302 
+#> F-statistic: 170.2 on 2 and 123 DF,  p-value: < 2.2e-16
+
+
+

8.0.2 A.2. Colophon

+

This report was generated on 2020-06-16 17:36:08 using the following computational environment and dependencies:

+
#> ─ Session info ───────────────────────────────────────────────────────────────────────────────────────────────────────
+#>  setting  value                       
+#>  version  R version 4.0.1 (2020-06-06)
+#>  os       Arch Linux                  
+#>  system   x86_64, linux-gnu           
+#>  ui       X11                         
+#>  language (EN)                        
+#>  collate  de_DE.UTF-8                 
+#>  ctype    de_DE.UTF-8                 
+#>  tz       Europe/Berlin               
+#>  date     2020-06-16                  
+#> 
+#> ─ Packages ───────────────────────────────────────────────────────────────────────────────────────────────────────────
+#>  package                         * version date       lib source                          
+#>  assertthat                        0.2.1   2019-03-21 [1] CRAN (R 4.0.0)                  
+#>  backports                         1.1.7   2020-05-13 [1] CRAN (R 4.0.0)                  
+#>  base64enc                         0.1-3   2015-07-28 [1] CRAN (R 4.0.1)                  
+#>  bibtex                            0.4.2.2 2020-01-02 [1] CRAN (R 4.0.1)                  
+#>  bookdown                          0.19    2020-05-15 [1] CRAN (R 4.0.1)                  
+#>  broom                             0.5.6   2020-04-20 [1] CRAN (R 4.0.1)                  
+#>  callr                             3.4.3   2020-03-28 [1] CRAN (R 4.0.0)                  
+#>  cli                               2.0.2   2020-02-28 [1] CRAN (R 4.0.0)                  
+#>  codetools                         0.2-16  2018-12-24 [2] CRAN (R 4.0.1)                  
+#>  colorspace                        1.4-1   2019-03-18 [1] CRAN (R 4.0.0)                  
+#>  crayon                            1.3.4   2017-09-16 [1] CRAN (R 4.0.0)                  
+#>  data.table                        1.12.8  2019-12-09 [1] CRAN (R 4.0.1)                  
+#>  desc                              1.2.0   2018-05-01 [1] CRAN (R 4.0.0)                  
+#>  devtools                          2.3.0   2020-04-10 [1] CRAN (R 4.0.0)                  
+#>  DiagrammeR                      * 1.0.6.1 2020-05-08 [1] CRAN (R 4.0.1)                  
+#>  digest                            0.6.25  2020-02-23 [1] CRAN (R 4.0.0)                  
+#>  doParallel                        1.0.15  2019-08-02 [1] CRAN (R 4.0.1)                  
+#>  dplyr                             1.0.0   2020-05-29 [1] CRAN (R 4.0.0)                  
+#>  ellipsis                          0.3.1   2020-05-15 [1] CRAN (R 4.0.0)                  
+#>  evaluate                          0.14    2019-05-28 [1] CRAN (R 4.0.0)                  
+#>  fansi                             0.4.1   2020-01-08 [1] CRAN (R 4.0.0)                  
+#>  farver                            2.0.3   2020-01-16 [1] CRAN (R 4.0.0)                  
+#>  flextable                       * 0.5.10  2020-05-15 [1] CRAN (R 4.0.1)                  
+#>  foreach                           1.5.0   2020-03-30 [1] CRAN (R 4.0.1)                  
+#>  fs                                1.4.1   2020-04-04 [1] CRAN (R 4.0.0)                  
+#>  gbRd                              0.4-11  2012-10-01 [1] CRAN (R 4.0.1)                  
+#>  gdtools                           0.2.2   2020-04-03 [1] CRAN (R 4.0.1)                  
+#>  generics                          0.0.2   2018-11-29 [1] CRAN (R 4.0.0)                  
+#>  ggplot2                         * 3.3.1   2020-05-28 [1] CRAN (R 4.0.1)                  
+#>  ggthemes                          4.2.0   2019-05-13 [1] CRAN (R 4.0.1)                  
+#>  glue                              1.4.1   2020-05-13 [1] CRAN (R 4.0.0)                  
+#>  gridExtra                       * 2.3     2017-09-09 [1] CRAN (R 4.0.1)                  
+#>  gtable                            0.3.0   2019-03-25 [1] CRAN (R 4.0.0)                  
+#>  here                              0.1     2017-05-28 [1] CRAN (R 4.0.1)                  
+#>  highr                             0.8     2019-03-20 [1] CRAN (R 4.0.0)                  
+#>  htmltools                         0.4.0   2019-10-04 [1] CRAN (R 4.0.0)                  
+#>  htmlwidgets                       1.5.1   2019-10-08 [1] CRAN (R 4.0.0)                  
+#>  iterators                         1.0.12  2019-07-26 [1] CRAN (R 4.0.1)                  
+#>  jsonlite                          1.6.1   2020-02-02 [1] CRAN (R 4.0.0)                  
+#>  knitr                             1.28    2020-02-06 [1] CRAN (R 4.0.0)                  
+#>  labeling                          0.3     2014-08-23 [1] CRAN (R 4.0.0)                  
+#>  lattice                           0.20-41 2020-04-02 [2] CRAN (R 4.0.1)                  
+#>  lifecycle                         0.2.0   2020-03-06 [1] CRAN (R 4.0.0)                  
+#>  magrittr                        * 1.5     2014-11-22 [1] CRAN (R 4.0.0)                  
+#>  Matrix                            1.2-18  2019-11-27 [2] CRAN (R 4.0.1)                  
+#>  memoise                           1.1.0   2017-04-21 [1] CRAN (R 4.0.0)                  
+#>  mgcv                              1.8-31  2019-11-09 [2] CRAN (R 4.0.1)                  
+#>  munsell                           0.5.0   2018-06-12 [1] CRAN (R 4.0.0)                  
+#>  nlme                              3.1-148 2020-05-24 [2] CRAN (R 4.0.1)                  
+#>  officer                           0.3.11  2020-05-18 [1] CRAN (R 4.0.1)                  
+#>  oxcAAR                            1.0.0   2020-06-02 [1] Github (ISAAKiel/oxcAAR@ab94508)
+#>  pillar                            1.4.4   2020-05-05 [1] CRAN (R 4.0.0)                  
+#>  pkgbuild                          1.0.8   2020-05-07 [1] CRAN (R 4.0.0)                  
+#>  pkgconfig                         2.0.3   2019-09-22 [1] CRAN (R 4.0.0)                  
+#>  pkgload                           1.1.0   2020-05-29 [1] CRAN (R 4.0.0)                  
+#>  plyr                              1.8.6   2020-03-03 [1] CRAN (R 4.0.1)                  
+#>  prettyunits                       1.1.1   2020-01-24 [1] CRAN (R 4.0.0)                  
+#>  processx                          3.4.2   2020-02-09 [1] CRAN (R 4.0.0)                  
+#>  ps                                1.3.3   2020-05-08 [1] CRAN (R 4.0.0)                  
+#>  purrr                             0.3.4   2020-04-17 [1] CRAN (R 4.0.0)                  
+#>  R6                                2.4.1   2019-11-12 [1] CRAN (R 4.0.0)                  
+#>  RColorBrewer                      1.1-2   2014-12-07 [1] CRAN (R 4.0.0)                  
+#>  Rcpp                              1.0.4.6 2020-04-09 [1] CRAN (R 4.0.0)                  
+#>  Rdpack                            0.11-1  2019-12-14 [1] CRAN (R 4.0.1)                  
+#>  remotes                           2.1.1   2020-02-15 [1] CRAN (R 4.0.0)                  
+#>  reshape2                        * 1.4.4   2020-04-09 [1] CRAN (R 4.0.1)                  
+#>  rlang                             0.4.6   2020-05-02 [1] CRAN (R 4.0.0)                  
+#>  rmarkdown                         2.2     2020-05-31 [1] CRAN (R 4.0.1)                  
+#>  rprojroot                         1.3-2   2018-01-03 [1] CRAN (R 4.0.0)                  
+#>  rstudioapi                        0.11    2020-02-07 [1] CRAN (R 4.0.0)                  
+#>  scales                            1.1.1   2020-05-11 [1] CRAN (R 4.0.0)                  
+#>  sensitivity.sumcal.article.2020 * 1.0.0.0 2020-06-16 [1] local                           
+#>  sessioninfo                       1.1.1   2018-11-05 [1] CRAN (R 4.0.0)                  
+#>  stringi                           1.4.6   2020-02-17 [1] CRAN (R 4.0.0)                  
+#>  stringr                           1.4.0   2019-02-10 [1] CRAN (R 4.0.0)                  
+#>  systemfonts                       0.2.3   2020-06-09 [1] CRAN (R 4.0.1)                  
+#>  testthat                          2.3.2   2020-03-02 [1] CRAN (R 4.0.0)                  
+#>  tibble                            3.0.1   2020-04-20 [1] CRAN (R 4.0.0)                  
+#>  tidyr                             1.1.0   2020-05-20 [1] CRAN (R 4.0.1)                  
+#>  tidyselect                        1.1.0   2020-05-11 [1] CRAN (R 4.0.0)                  
+#>  usethis                           1.6.1   2020-04-29 [1] CRAN (R 4.0.0)                  
+#>  uuid                              0.1-4   2020-02-26 [1] CRAN (R 4.0.1)                  
+#>  vctrs                             0.3.1   2020-06-05 [1] CRAN (R 4.0.1)                  
+#>  visNetwork                        2.0.9   2019-12-06 [1] CRAN (R 4.0.1)                  
+#>  withr                             2.2.0   2020-04-20 [1] CRAN (R 4.0.0)                  
+#>  xfun                              0.14    2020-05-20 [1] CRAN (R 4.0.0)                  
+#>  xml2                              1.3.2   2020-04-23 [1] CRAN (R 4.0.0)                  
+#>  yaml                              2.2.1   2020-02-01 [1] CRAN (R 4.0.0)                  
+#>  zip                               2.0.4   2019-09-01 [1] CRAN (R 4.0.1)                  
+#> 
+#> [1] /home/martin/R/x86_64-pc-linux-gnu-library/4.0
+#> [2] /usr/lib/R/library
+

The current Git commit details are:

+
#> Local:    master /home/martin/r_projekte/sensitivity.sumcal.article.2020
+#> Remote:   master @ origin (git@github.com:MartinHinz/sensitivity.sumcal.article.2020.git)
+#> Head:     [8cb79cc] 2020-06-16: added actionable items for review
+
+
+ + + + +
+ + + + + + + + + + + + + + + diff --git a/analysis/paper/03_revised_version/paper_revised.pdf b/analysis/paper/03_revised_version/paper_revised.pdf new file mode 100644 index 0000000..9a2254b Binary files /dev/null and b/analysis/paper/03_revised_version/paper_revised.pdf differ diff --git a/analysis/paper/03_revised_version/paper_revised_marked.docx b/analysis/paper/03_revised_version/paper_revised_marked.docx new file mode 100644 index 0000000..978edfb Binary files /dev/null and b/analysis/paper/03_revised_version/paper_revised_marked.docx differ