-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathpublished-202312-favrot-hierarchical.qmd
854 lines (609 loc) · 71 KB
/
published-202312-favrot-hierarchical.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
---
title: "A hierarchical model to evaluate pest treatments from prevalence and intensity data"
author:
- name: "Armand Favrot"
corresponding: true
email: [email protected]
url: https://fr.linkedin.com/in/armand-favrot-469014150
affiliations:
- name: MIA Paris-Saclay, INRAE AgroParisTech Université Paris-Saclay, France
url: https://mia-ps.inrae.fr/
- name: "David Makowski"
corresponding: true
email: [email protected]
url: https://mia-ps.inrae.fr/david-makowski
affiliations:
- name: MIA Paris-Saclay, INRAE AgroParisTech Université Paris-Saclay, France
url: https://mia-ps.inrae.fr/
date: 01/09/2024
date-modified: last-modified
description: |
abstract: >+
In plant epidemiology, pest abundance is measured in field trials using metrics assessing either pest prevalence (fraction of the plant population infected) or pest intensity (average number of pest individuals present in infected plants). Some of these trials rely on prevalence, while others rely on intensity, depending on the protocols.
In this paper, we present a hierarchical Bayesian model able to handle both types of data. In this model, the intensity and prevalence variables are derived from a latent variable representing the number of pest individuals on each host individual, assumed to follow a Poisson distribution. Effects of pest treaments, time trend, and between-trial variability are described using fixed and random effects.
We apply the model to a real data set in the context of aphid control in sugar beet fields. In this data set, prevalence and intensity were derived from aphid counts observed on either factorial trials testing different types of pesticides treatments or field surveys monitoring aphid abundance.
Next, we perform simulations to assess the impacts of using either prevalence or intensity data, or both types of data simultaneously, on the accuracy of the model parameter estimates and on the ranking of pesticide treatment efficacy.
Our results show that, when pest prevalence and pest intensity data are collected separately in different trials, the model parameters are more accurately estimated using both types of trials than using one type of trials only. When prevalence data are collected in all trials and intensity data are collected in a subset of trials, estimations and pest treatment ranking are more accurate using both types of data than using prevalence data only. When only one type of observation can be collected in a pest survey or in an experimental trial, our analysis indicates that it is better to collect intensity data than prevalence data when all or most of the plants are expected to be infested, but that both types of data lead to similar results when the level of infestation is low to moderate. Finally, our simulations show that it is unlikely to obtain accurate results with fewer than 40 trials when assessing the efficacy of pest control treatments based on prevalence and intensity data.
Because of its flexibility, our model can be used to evaluate and rank the efficacy of pest treatments using either prevalence or intensity data, or both types of data simultaneously. As it can be easily implemented using standard Bayesian packages, we hope that it will be useful to agronomists, plant pathologists, and applied statisticians to analyze pest surveys and field experiments conducted to assess the efficacy of pest treatments.
keywords: [bayesian model, epidemiology, hierarchical model, pest control, trial, survey]
citation:
type: article-journal
container-title: "Computo"
doi: "10.57750/6cgk-g727"
publisher: "French Statistical Society"
issn: "2824-7795"
url: "https://computo.sfds.asso.fr/published-202312-favrot-hierarchical/"
pdf-url: "https://computo.sfds.asso.fr/published-202312-favrot-hierarchical/published-202312-favrot-hierarchical.pdf"
google-scholar: true
bibliography: references.bib
github-user: computorg
repo: "published-202312-favrot-hierarchical"
draft: false
published: true
format:
computo-html: default
computo-pdf: default
execute:
freeze: auto # re-render only when source changes
---
```{r dependencies, cache = FALSE, include = FALSE}
library(ggplot2);
library(tidyverse);
library(gridExtra);
library(rjags);
library(latex2exp);
library(grid);
library(dplyr);
library(tidyr);
library(tibble)
library(kableExtra)
library(coda)
x = 6; y = 8; z = 7
theme_replace(axis.text = element_text(size = x), axis.title.x = element_text(size = y),
axis.title.y = element_text(size = y, angle = 90, margin = margin(r = 5)),
strip.text = element_text(size = z, face = "bold", color = "white"),
legend.text = element_text(size = y), legend.title = element_text(size = y, face = "bold"),
plot.title = element_text(size = z, hjust = 0.5, face = "bold"),
strip.background = element_rect(fill = "#4345a1"), panel.background = element_rect(fill = "#f3faff"), panel.grid.major = element_line(colour = "white"))
options(ggplot2.discrete.fill = list("#67c5a7"))
```
# Introduction
In plant epidemiology, pest and disease presence can be measured in a host population using different metrics. A first metric measures the presence/absence of the pest in the individuals (plants) of the host population. This metric is often called prevalence or incidence (@madden1999sampling, @shaw2018metrics). The prevalence describes the proportion of the host population in which the pest is present. This metric is relevant and widely used, but it does not account for the number of pest individuals per host individual. With the prevalence, a plant infected by one single pest individual (e.g., an insect) and a plant infected by many pest individuals both represents one infected plant. For this reason, pest abundance is sometimes assessed using another metric representing the average number of pest individuals per host individual. This metric is called intensity or severity (@madden1999sampling, @shaw2018metrics), and describes the intensity of the disease in the target population. These two metrics do not generally have the same requirement in terms of working time; measuring intensity takes indeed much more time than measuring prevalence because it is very tedious to count all pest individuals, especially when individuals are small, numerous, and/or difficult to detect.
Pest prevalence and intensity are commonly measured in factorial field trials to test the efficacy of different treatments. In this paper, we place ourselves in an important application framework which is the evaluation of alternative pesticide treatments to neonicotinoids against aphids in sugar beet. Indeed, neonicotinoids had been a popular chemical treatment to control aphids for many years, especially in sugar beets, a major crop in Europe. Recently, neonicotinoids were recognized as presenting high risks for the environment with a negative impact on a wide range of non-target organisms, including bees (@wood2017environmental, @pisa2015effects), and this family of pesticides has been banned in several European countries. In order to find a substitute to neonicotinoids, a number of factorial field trials were conducted to compare the efficacy of different alternative treatments during several years in different countries. Each trial consists of a set of plots divided into several blocks, themselves divided into several strips on which different pesticide treatments are randomly allocated. One strip always remains untreated to serve as a control. In each strip, aphid prevalence, aphid intensity or both metrics are measured in a sample of plants (usually, 10-20 plants per strip). Depending on the protocol and on the working time constraint, either one type or both types of metrics are measured. Consequently, for a given pest treatment, some trials may report prevalence data while others report intensity data or both types of data.
This heterogeneity raises several issues. A first issue concerns the statistical analysis of the trials reporting prevalence and intensity. Although it is easy to fit a generalized linear model to each type of data separately, it is less straightforward to fit a single model to the whole set of trials in order to obtain a single ranking of the pest treatments taking into account both types of data at the same time. Generally, factorial trials assessing treatment efficacy are analyzed with statistical models that take into account one of the two metrics but not both. Prevalence data are thus commonly analyzed using binomial generalized linear models and intensity data are frequently analyzed with Poisson generalized linear models (@michel2017framework, @LAURENT2023106140, @agresti2015foundations). As far as we know, no statistical model has been proposed to assess treatment efficacy based on the simultaneous analysis of prevalence and intensity data. In @osiewalski2019joint, the authors introduced a switching model designed to handle two count variables, one of which may be degenerate. This model was employed to characterize the counts of cash payments and bank card payments in Poland, utilizing data from both cardholders and non-cardholders. A generalized form of the bivariate negative binomial regression model was developed in @gurmu2000generalized, allowing for a more flexible representation of the correlation between the dependent variables. This model was applied to describe the number of visits to a doctor and the number of visits to non-doctor health professionals. It outperformed existing bivariate models across various model comparison criteria. In order to analyze data related to crash counts categorized by severity, @park2007multivariate employed a multivariate Poisson log-normal model, effectively addressing both over-dispersion and a fully generalized correlation structure within the data set. However, it should be noted that these models did not include any binomial distribution and thus could not be used to deal with proportion data, such as pest prevalence.
Another issue concerns the practical value of combining both prevalence and intensity data. It is unclear whether the simultaneous analysis of prevalence and intensity data may increase the accuracy of the estimated treatment efficacy compared to the use of a single type of data, and whether this may increase the probability of identifying the most efficient treatments. Finally, it is also unclear how future trials should be designed, in particular how many trials are required to obtain accurate estimations, and whether intensity data should be preferred to prevalence data.
In this paper, we propose a new flexible statistical model that can be used to rank pest treatments from trials including prevalence data, intensity data, or both. We apply it to a real data set including trials testing the efficacy of pesticides against aphids infesting sugar beets, considering contrasted scenarios of data availability, and we show how the proposed model can be used to evaluate the efficacy of different treatments. Based on simulations, we then quantify the reduction of mean absolute errors in the estimated treatment efficacy resulting from the use of both prevalence and intensity data during the statistical inference, compared to the use of either prevalence or intensity data. The rest of the paper is organized as follows. First, we present the structure of the data set including real prevalence and intensity data. Next, we describe in detail the proposed model, the inference method, and the simulation strategy. After checking the convergence of the fitting algorithm, we show how the model can be used to assess treatment efficacy. We finally present the results based on simulated data and we make several recommendations.
# Material and Method
## Description of the data
Data are collected in 32 field trials conducted in France, Belgium and the Netherlands to compare several treatments against aphids in sugar beets. Each trial consists in a plot located in a given site at a given year (site-year) divided into one to four blocks. Each of these blocks is itself divided into strips where different treatments are tested, one of these treatments being an untreated control and the others corresponding to different types of insecticide. In each strip of each block, the number of aphids is counted on a sample of 10 beet plants (intensity). The number of infested plants (prevalence) is measured as well, but only in 15 trials out of 32. The total numbers of intensity and prevalence data are equal to 1128 and 561, respectively. Note that the number of aphids is not counted on each beet plant but in the whole plant sample. Intensity and prevalence are monitored at different times after treatments. As shown in @fig-one A, the data set is unbalanced as less data are available for the treatment Mavrik-jet compared to the others. @fig-one B shows that the intensity and prevalence tend to increase with time.
```{r figure1, cache = FALSE, fig.width=12, fig.height=14}
#| echo: false
#| label: fig-one
#| fig-cap: "Description of the data set. **A** Number of observations according to the type of insecticide. **B** Examples of observed number of aphids averaged over the blocks (intensity) and number of infested beets out of ten (prevalence) averaged over the blocks, at different dates for two trials."
data_figure_1A = readRDS(file = "data/data_figure_1A.rds")
data_figure_1B = readRDS(file = "data/data_figure_1B.rds")
rectangle_fig1A <- grobTree(rectGrob(gp = gpar(fill = "#e9e9e9")), textGrob("A", x = 0.5, hjust = 0.5, gp = gpar(cex = 2.5, fontface = "bold")))
fig1A = ggplot(data_figure_1A) +
geom_bar(aes(x = Insecticide, y = n, fill = Insecticide), stat = "Identity") +
xlab("Insecticide") +
ylab("Number of observations") +
scale_fill_manual(values = c('#df626c', '#893f3d', '#ff842e', '#188038')) +
scale_x_discrete(limits = c("Untreated", "Mavrik Jet", "Movento", "Teppeki")) +
theme(legend.position = "none",
axis.title.x = element_text(margin = margin(b = 50, t = 20), size = 18),
axis.title.y = element_text(size = 18, margin = margin(r = 20)),
axis.text = element_text(size = 16))
rectangle_fig1B <- grobTree(rectGrob(gp = gpar(fill = "#e9e9e9")), textGrob("B", x = 0.5, hjust = 0.5, gp = gpar(cex = 2.5, fontface = "bold")))
fig1B_aphid_intensity = ggplot(data_figure_1B) +
geom_point(aes(x = DPT, y = Ymean, col = Insecticide), size = 2.5) +
geom_line(aes(x = DPT, y = Ymean, col = Insecticide), size = 0.2) +
scale_color_manual(values = c('#893f3d', '#ff842e', '#188038')) +
facet_wrap(~ ID, scales = "free", ncol = 2) +
theme(legend.position = "bottom") +
xlab("") +
ylab("Number of aphids") +
theme(plot.margin = margin(t = 40, r = 40, b = 0, l = 10),
axis.title.x = element_text(size = 18),
axis.title.y = element_text(size = 18, margin = margin(r = 20)),
axis.text = element_text(size = 16),
strip.text = element_text(size = 20),
legend.title = element_text(size = 18),
legend.text = element_text(size = 18))
fig1B_aphid_prevalence = ggplot(data_figure_1B) + geom_point(aes(x = DPT, y = Zmean, col = Insecticide), size = 2.5) +
geom_line(aes(x = DPT, y = Zmean, col = Insecticide), size = 0.2) +
scale_color_manual(values = c('#893f3d', '#ff842e', '#188038')) +
facet_wrap(~ ID, ncol = 2, scales = "free_x") + theme(legend.position = "none") +
xlab("Days post treatment") +
ylab("Number of infested beets") +
theme(panel.spacing = unit(1, "cm")) +
theme(plot.margin = margin(t = 0, r = 40, b = 0, l = 10),
axis.title.y = element_text(size = 18, margin = margin(r = 20)),
axis.text = element_text(size = 16),
axis.title.x = element_text(margin = margin(b = 20, t = 20), size = 18),
strip.text = element_text(size = 20))
legend <- cowplot::get_legend(fig1B_aphid_intensity)
do.call("grid.arrange",
c(list(rectangle_fig1A,
fig1A + theme(plot.margin = margin(t = 30, r = 20)),
rectangle_fig1B,
fig1B_aphid_intensity + theme(legend.position = "none"),
fig1B_aphid_prevalence), list(legend), list(ncol = 1,
layout_matrix = rbind(c(1), c(2), c(3), c(4), c(5), c(6)), heights = c(0.12, 1.2, 0.12, 1, 1, 0.12))))
```
## Model
### Specification
We introduce an unobserved variable representing the number of pest individuals (here, aphids) on each plant in a sample of $N$ plants (here, sugar beets). This variable is noted $W$ and is assumed to follow a Poisson probability distribution whose mean value is a function of time.
We use the following indexes: $i$ for the trial, $j$ for the treatment, $k$ for the block, $t$ for the time and $s$ for the plant number. The distribution of $W_{ijkts}$ is defined as:
$$
W_{ijkts}\sim\mathcal{P}(\lambda_{ijkt})
$$ {#eq-model_W}
$$
\log\ \lambda_{ijkt} = \alpha_0 + \beta_{0i} + \gamma_{0j} + (\alpha_1 + \gamma_{1j})\ X_t + u_{ij} + \epsilon_{ijkt}
$$ {#eq-model_lambda}
with
- $\beta_{0i}\sim\mathcal{N}(0, \sigma_0^2)$
- $u_{ij}\sim\mathcal{N}(0, \chi^2)$
- $\epsilon_{ijkt}\sim\mathcal{N}(0,\eta^2)$
The random variables are all assumed independent. The parameters $\alpha_0$, $\alpha_1$, $\gamma_{0j}$, and $\gamma_{1j}$ are considered as fixed. This model serves as a tool for conducting inference on a population of trials, from which the subset of trials comprising our data set is assumed to constitute a random sample. In essence, the trials contained within our database are leveraged to estimate the parameter values that characterize a target population, where the tested pest control treatments will be actually implemented. Consequently, all parameters contingent on individual trials have been defined as random effects.
The observed variables (intensity and prevalence) can be expressed as a function of $W$. We note:
- $Y_{ijkt}$ the number of pest individuals (aphids) in the sample of $N_i$ plants collected in trial i, treatment j, block k, at time t
- $Z_{ijkt}$ the number of infested plants among the $N_i$ plants collected in trial i, treatment j, block k, at time t
Then, assuming the $W$s are independent, we have:
$$
Y_{ijkt} = \sum\limits_{s = 1}^{N_i} W_{ijkts}\ \hspace{1cm}\ Y_{ijkt} \sim \mathcal{P}(N_{i}\lambda_{ijkt})
$$ {#eq-model_Y}
$$
Z_{ijkt} = \sum\limits_{s = 1}^{N_i} \boldsymbol{1}_{W_{ijkts}>0}\ \hspace{1cm} \ Z_{ijkt} \sim \mathcal{B}(N_{i},\ \pi_{ijkt})
$$ {#eq-model_Z}
where $\pi_{ijkt} = 1-\text{exp}(-\lambda_{ijkt})$. The different quantities used in the model are defined in @tbl-model-desc.
| | |
--|--
i | trial index
j | treatment index
k | block index
t | time index
s | plant index
$N_i$ | sample size (number of plants)
$\lambda_{ijkt}$ | mean number of pest individuals (aphids) on one plant
$\pi_{ijkt}$ | probability for a plant to be infested
$\alpha_0$ | mean number of pest individuals (aphids) in the untreated group
$\beta_{0i}$ | trial effect
$\gamma_{0j}$ | effect of treatment j at time 0 (time of treatment)
$\alpha_1$ | growth parameter of the number of pest individuals for the untreated group
$\gamma_{1j}$ | effect of the treatment on the time effect (interaction between treatment j and time)
$X_t$ | number of days post treatment
$u_{ij}$ | random interaction between trial and treatment
$\epsilon_{ijkt}$ | residuals
: Description of the indices, inputs and parameters used in the model {#tbl-model-desc tbl-colwidths="[15,85]"}
From this model we define the efficacy of the $j$th treatment at time $t$ ($t$ days after pesticide application) by the quantity (@LAURENT2023106140):
$$
\text{Ef}_{jt} = \Big(1 - \text{exp}\big(\gamma_{0j}+\gamma_{1j} \times X_{t}\big)\Big) \times 100
$$ {#eq-Efficacy}
The quantity $\text{Ef}_{jt}$ corresponds to the expected percentage reduction of pest individuals (aphid numbers) for the j-th treatment compared to the untreated group, over trials and blocks.
Our Poisson log linear model includes an additive random dispersion term associated to each individual observation ($\epsilon_{ijkt}$ in @eq-model_lambda). This is a standard and well-recognized approach to deal with over-dispersion (@harrison2014using). In order to check the model assumptions, we perform a posterior predictive check of our model to check that the data were compatible with the model assumptions ([Supplementary material](#sec-supplementary-material)). Posterior predictive check is frequently used to look for systematic discrepancies between real and simulated data (@gelman1995bayesian). To do so, we compute the probability of exceeding each individual data with the fitted model (@eq-model_lambda). The computed probabilities are all falling in the range 0.22-0.93 (except for the observations equal to 0, for which the probability of being greater was equal to 1), and are thus not extreme. This result indicates that the model specified is not incompatible with the observed data and that the over-dispersion is correctly taken into account. In addition, we fit another model including a negative binomial distribution instead of a Poisson distribution. The results are almost identical between the two models ([Supplementary material](#sec-supplementary-material)).
### Inference on real data
The model parameters are estimated using Bayesian inference with a Markov chain Monte-Carlo method. We perform the inference using R, with the package rjags (@rjags). For each of the data sets listed in @tbl-data, we fit the model (@eq-model_lambda - @eq-model_Z) with the following weakly informative priors: $\mathcal{N}(0, 10^3)$ for the parameters $\alpha_0,\ \gamma_0,\ \alpha_1,\ \gamma_1$ and $\mathcal{U}([0, 10])$ for the parameters $\sigma_0, \chi, \eta$. We use two Markov chains with $2 \times 10^5$ iterations (after an adaptation phase of $10^5$ iterations), and we center the time variable $t$ to facilitate convergence.
The convergence of the MCMC algorithm is checked by inspecting the mixing of the two Markov chains and monitoring the Gelman-Rubin diagnosis statistics (@gelman1992inference). We then compute the posterior mean of the pesticide treatment efficacy (defined by @eq-Efficacy) as well as its 95% credibility interval. The code used to fit the model is provided below.
::: {.callout-note}
The following code presents the inference on an extract of the real data set, which includes both trials of figure 1B (2020 - B1A97 ; 2020 - u1CwE). It is a demo for the "50% Y - 50% Z" scenario and we set here the number of adaptation and iteration to 2000 in order to reduce computation time.
:::
```{r inference_example_real_data, cache = FALSE, results = 'hide'}
#| echo: true
#| eval: true
#| file: scripts/inference_example_real_data.R
```
\
In practice, it is common that only $Y$ or $Z$ data are available in some of the trials. In this case, the resulting data set includes observations of $Y$ in some trials and observations of $Z$ in others. The data set may even include one type of observations only, either $Y$ or $Z$, in all trials. Here, we define four scenarios with contrasted levels of $Y$ and $Z$ availability in order to evaluate the consequences of using different types of data sets. We consider four data subsets defined from the real data set including trials with observations of $Y$, with observations of $Z$, or with both types of observation in different proportions (@tbl-data). The data subset "100% Y - 0% Z" includes Y data collected in the 32 trials. The data subset "50% Y - 0% Z" includes Y data collected in the 17 trials for which no Z observation is available. The data subset "0% Y - 50% Z" includes the Z data collected in the 15 trials for which Z observations are available. The data subset "50% Y - 50% Z" includes $Y$ data collected in 17 trials and $Z$ data collected in the other 15 trials. The latter data subset does not include any trial reporting both $Y$ and $Z$ data. Throughout our analysis, missing data are assumed to be missing at random.
The hierarchical model defined above is fitted to each data set in turn. Each fitted model is then used to compute the posterior mean and 95% credibility interval of $Ef_{jt}$ for each treatment at $t$=6 and 12 days after pesticide application.
| Type of data set | Description |
|---------|:-----|
| 100% Y - 0% Z | Y observations available in the 32 trials and no Z |
| 50% Y - 0% Z | Y observations available in 17 trials and no Z |
| 0% Y - 50% Z | Z observations available in 15 trials and no Y |
| 50% Y - 50% Z | Y observations available in 17 trials and Z observations available in the other 15 trials |
: Four data subsets defined from the original data set (real data). {#tbl-data tbl-colwidths="[35,65]"}
### Simulations
Simulations are carried out to further investigate the impact of the type and amount of data available on the accuracy of the parameters and the ability of the model to identify the most and least effective treatments.
We define three numbers of trials, equal to 20, 40 and 80, successively. For data simulations, the model parameters are set equal to those estimated from the real data set "100% Y - 0% Z" defined in @tbl-data (posterior means). For each number of trials, we generate virtual data from the model (@eq-model_W - @eq-model_Z) and estimate the model parameters, according to the following procedure:
- Draw values of $\beta_{0i},\ u_{ij}$ and $\epsilon_{ijkt}$ in their distributions for each trial, 3 treatments (+ the untreated control), 3 dates ($t$=0, 6, 12), and 4 blocks
- Calculate $\lambda_{ijkt}$ from @eq-model_lambda,
- Draw values of $W_{ijkts}$ in its Poisson distribution for 10 plants ($s$=1, ..., 10)
- Calculate $Y_{ijkt}$ and $Z_{ijkt}$ from the $W$s for each trial, treatment, date, block.
- Generate the eight data subsets corresponding to the scenarios defined in @tbl-subsets (including all values of $W$, the generated values of $Y$ only, the generated values of $Z$ only, both $Y$ and $Z$ values but not $W$, the values of $Y$ in 50% of the trials, the values of $Z$ in 50% of the trials, the values of $Y$ in 50% of the trials and the values of $Z$ in the other 50%),
- Fit the model (@eq-model_W - @eq-model_Z) to each of the data subsets according to the procedure described above based on MCMC.
At the end of this procedure, we get eight sets of estimated parameters, corresponding to the eight scenarios defined in @tbl-subsets.
This procedure is repeated 1000 times with each time a different seed between 0 and 999. However, the computations performed with jags failed for 26 replicates and thus 974 replicates were available for the analysis.
For each number of trials and each scenario defined in @tbl-subsets, the accuracy of the estimated parameters $\gamma$ (from which depend the treatment efficacies) is evaluated by computing an absolute error, averaged over the three treatments ($j$=1 corresponding to the control) as:
$$
E_{\gamma} = \frac{1}{2 \times 3}\sum\limits_{j = 2}^{4} \Big( \frac{|\gamma_{0j} - \hat{\gamma}_{0j}|}{|\gamma_{0j}|} + \frac{|\gamma_{1j} - \hat{\gamma}_{1j}|}{|\gamma_{1j}|}\Big)
$$ {#eq-E_gamma}
where the true parameter values are set equal to the posterior means computed with the real data set, and the parameter estimates ($\hat{\gamma}_{0j}$ and $\hat{\gamma}_{1j}$) are the posterior means computed for the j-th treatment. For each trial number and each scenario defined in @tbl-subsets, the 974 values of $E_{\gamma}$ obtained for the 974 generated data subsets are then averaged. The average values obtained for the eight scenarios are finally compared to determine the type of data leading to the most accurate estimated parameter values.
In addition, we compare the eight scenarios according to another criterion measuring the difference between the estimated efficacy values and the true efficacy value (averaged over the three pesticide treatments considered), as follows:
$$
E_{\text{Ef}_{t}}\ = \frac{1}{3} \sum\limits_{j = 2}^4 \frac{|\text{Ef}_{jt} - \hat{\text{Ef}}_{jt}|} {|\text{Ef}_{jt}|}
$$ {#eq-E_efficacy}
where the true treatment efficacy is defined by @eq-Efficacy (setting the parameters $\gamma$s to the posterior means obtained with the real data set) and the estimated efficacy ($\hat{\text{Ef}}_{jt}$) is the posterior mean computed with the simulated data set. The 974 values of $E_{\text{Eft}}$ obtained from the 974 simulated data sets are then averaged for each trial number and each scenario. Finally, we evaluate the proportions of cases where the true best treatment (i.e., the treatment with the highest efficacy) is correctly identified.
In order to determine whether the difference of performance obtained with the types of data $Y$ and $Z$ depends on the pest abundance, we perform an additional series of simulations with three values for the model parameter $\alpha_0$, equal to $-1,\ 1$ and $2$, successively. These three values of $\alpha_0$ define three contrasted levels of pest abundance (the higher $\alpha_0$, the higher the abundance). We used the procedure outlined above considering only two scenarios of data availability, namely "100% Y - 0% Z" and "0% Y - 100% Z". This procedure is implemented with each value of $\alpha_0$ in turn. The results are used to compare the model performances using either $Y$ or $Z$ for parameter estimation, depending of the pest abundance specified by $\alpha_0$.
|Type of data set| Description|
|---|---|
|100% W | W observations available in all the trials|
|100% Y - 0% Z | Y observations available in all the trials|
|0% Y - 100% Z | Z observations available in all the trials|
|100% Y - 100% Z | Y and Z observations available in all the trials|
|50% Y - 0% Z | Y observations available in half of the trials|
|0% Y - 50% Z | Z observations available in half of the trials|
|50% Y - 50% Z | Y observations available in half of the trials and Z observations available in the other half of the trials|
|50% Y - 100% Z | Y observations available in half of the trials and Z observations available in all the trials|
: Eight scenarios compared using simulated data. {#tbl-subsets tbl-colwidths="[35,65]"}
These simulations required a significant amount of computation time and were conducted on the INRAE MIGALE server. For a single seed and 20 trials, the computations took 80 minutes on a computer with an Intel Core i7 processor running at 1.90 GHz and 32 GB of RAM. The code used to perform them is presented below.
```{r simulation-function, cache = FALSE}
#| echo: true
#| eval: true
#| file: functions/simu_data.R
```
\
::: {.callout-note}
The following code presents the inference for one seed. As the computing time is growing fast with the number of trials and the number of sampled values, we set here the number of trials to 10 (I = 10) and the number of adaptations and iterations to 2000.
Our simulation results are obtained by running this code for I = 10, 20, 40, 80, for all seeds between 0 and 999, with a number of adaptations and iterations equal to 36000.
Scenarios "50% Y - 0% Z" and "0% Y - 50% Z" are obtained from scenarios "100% Y - 0% Z" and "0% Y - 100% Z", respectively. For example "50% Y - 0% Z" scenario with 40 trials corresponds to "100% Y - 0% Z" with 20 trials.
:::
```{r simulations, results = 'hide', cache = TRUE}
#| echo: true
#| eval: true
#| file: scripts/simulations.R
```
# Results
## Results obtained with real data
```{r load_res_real_data, cache = FALSE, echo = FALSE}
resY = readRDS(file = "results/Real_data/res.Y.nadapt.150000.niter.3e+05.rds")
resYZ = readRDS(file = "results/Real_data/res.YhalfZhalf.nadapt.150000.niter.3e+05.rds")
resYhalf = readRDS(file = "results/Real_data/res.Yhalf.nadapt.150000.niter.3e+05.rds")
resZhalf = readRDS(file = "results/Real_data/res.Zhalf.nadapt.150000.niter.3e+05.rds")
```
We present here the results obtained with real data. First, we check the convergence of the model for the scenarios defined in @tbl-data. We then compare the estimated values and the credibility intervals of the treatment efficacy obtained in the different scenarios.
### Model convergence and posterior distributions
@fig-two presents the Markov chains associated with the model parameters and treatment efficacies for the different scenarios. The x-axis presents the iteration number and the y-axis presents the sampled value. Results show that the chains are well mixed. @fig-three presents the Gelman-Rubin statistics associated with the model parameters and treatment efficacies as a function of the iterations, for the different scenarios. We observe that this statistics converges to 1, which indicates the convergence of the algorithm.
@tbl-posterior gives a summary of the posterior distributions of the model parameters and treatment efficacies obtained for the "100% Y - 0% Z" scenario. The significantly positive value of $\alpha_1$ indicates that the aphid numbers tend to increase with time in untreated plots. The relatively high value of $\sigma_0$ (posterior mean equal to 1.87) reveals a strong variability in aphid numbers between trials. The posterior mean value of $\chi$ (0.27) suggests that the treatment efficacy varies across trials. The $\gamma_0$ parameter is negative for all three treatments, indicating a negative effect of the treatments on the aphid numbers at the time of pesticide spray. The Movento and Teppeki treatments have a similar effect with a posterior mean for $\gamma_0$ equal to -1.13 and -1.24, and a standard deviation equal to 0.12 and 0.11, respectively. The effect of treatment Mavrik Jet is weaker as its posterior mean for $\gamma_0$ is equal to -0.13 and its standard deviation is equal to 0.16. The $\gamma_1$ posterior means are negative for Movento and Teppeki (-0.14 and -0.15), suggesting that the effect of these treatments tend to increase with time, but the posterior mean value is positive for Mavrik Jet (= 0.24), suggesting that the effect of this treatment may decrease with time. However, the 95% credibility intervals of $\gamma_1$ include zero and these parameters are not very accurately estimated.
```{r figure2, cache = FALSE, out.width="100%", fig.height = 7, echo = FALSE}
#| label: fig-two
#| fig-cap: "Model convergence - Markov chain for the model parameters and treatment efficacies, in the scenarios \"50% Y - 0% Z\" (**A**), \"50% Y - 50% Z\" (**B**), \"100% Y - 0% Z\" (**C**) and \"0% Y - 50% Z\" (**D**). The x-axis presents the iteration number and the y-axis presents the sampled value."
source(file = "functions/plot_chains.R")
options(warn = - 1)
col_background = "#e9e9e9"; col_text = "black"
cex = 0.85
nrow = 3
nbreaks = 3
strip_size = 5
fig2A = suppressMessages(plot_chains(resYhalf, nrow = nrow) + ylab("Sampled value") + scale_x_continuous(breaks = c(0, 1.5e4, 3e4))) + xlab("") +
scale_y_continuous(n.breaks = nbreaks) + theme(strip.text = element_text(size = strip_size))
my_g2A <- grobTree(rectGrob(gp = gpar(fill = col_background)), textGrob("A", x = 0.5, hjust = 0.5, gp = gpar(col = col_text, cex = cex, fontface = "bold")))
fig2B = suppressMessages(plot_chains(resYZ, nrow = nrow) + ylab("Sampled value") + scale_x_continuous(breaks = c(0, 1.5e4, 3e4))) + xlab("") +
scale_y_continuous(n.breaks = nbreaks) + theme(strip.text = element_text(size = strip_size))
my_g2B <- grobTree(rectGrob(gp = gpar(fill = col_background)), textGrob("B", x = 0.5, hjust = 0.5, gp = gpar(col = col_text, cex = cex, fontface = "bold")))
fig2C = suppressMessages(plot_chains(resY, nrow = nrow) + ylab("Sampled value") + scale_x_continuous(breaks = c(0, 1.5e4, 3e4))) + xlab("") +
scale_y_continuous(n.breaks = nbreaks) + theme(strip.text = element_text(size = strip_size))
my_g2C <- grobTree(rectGrob(gp = gpar(fill = col_background)), textGrob("C", x = 0.5, hjust = 0.5, gp = gpar(col = col_text, cex = cex, fontface = "bold")))
fig2D = suppressMessages(plot_chains(resZhalf, nrow = nrow) + ylab("Sampled value") + scale_x_continuous(breaks = c(0, 1.5e4, 3e4))) +
scale_y_continuous(n.breaks = nbreaks) + theme(strip.text = element_text(size = strip_size), axis.title.x = element_text(margin = margin(t = 5)))
my_g2D <- grobTree(rectGrob(gp = gpar(fill = col_background)), textGrob("D", x = 0.5, hjust = 0.5, gp = gpar(col = col_text, cex = cex, fontface = "bold")))
grid.arrange(my_g2A, fig2A, my_g2B, fig2B, my_g2C, fig2C, my_g2D, fig2D, heights = c(1,9,1,9,1,9,1,10))
```
\
```{r figure3, cache = FALSE, out.width="100%", fig.height = 7, echo = FALSE}
#| label: fig-three
#| fig-cap: "Model convergence - Gelman-Rubin statistics for the parameters of the model (@eq-model_W - @eq-model_Z) and for treatment efficacies (@eq-Efficacy), according to the scenarios \"50% Y - 0% Z\" (**A**), \"50% Y - 50% Z\" (**B**), \"100% Y - 0% Z\" (**C**) and \"0% Y - 50% Z\" (**D**). The x-axis presents the iteration number and the y-axis presents the Gelman-Rubin statistic."
source(file = "functions/gelman.plot2.R")
margey = 5
margex = 10
ncol = 9
strip_size = 5
breaks = c(1, 3, 5); limits = c(1, 5)
fig3A = gelman.plot2(resYhalf, ncol = ncol) +
scale_x_continuous(breaks = c(2e5, 4e5)) + xlab("") +
scale_y_continuous(limits = limits, breaks = breaks) +
theme(axis.title.y = element_text(margin = margin(l = - margey, r = margey)), legend.position = "none", strip.text = element_text(size = strip_size))
my_g3A <- grobTree(rectGrob(gp = gpar(fill = col_background)), textGrob("A", x = 0.5, hjust = 0.5, gp = gpar(col = col_text, cex = cex, fontface = "bold")))
fig3B = gelman.plot2(resYZ, ncol = ncol) +
scale_x_continuous(breaks = c(2e5, 4e5)) + xlab("") +
scale_y_continuous(limits = limits, breaks = breaks) +
theme(axis.title.y = element_text(margin = margin(l = - margey, r = margey)), legend.position = "none", strip.text = element_text(size = strip_size))
my_g3B <- grobTree(rectGrob(gp = gpar(fill = col_background)), textGrob("B", x = 0.5, hjust = 0.5, gp = gpar(col = col_text, cex = cex, fontface = "bold")))
fig3C = gelman.plot2(resY, ncol = ncol) +
scale_x_continuous(breaks = c(2e5, 4e5)) + xlab("") +
scale_y_continuous(limits = limits, breaks = breaks) +
theme(axis.title.y = element_text(margin = margin(l = - margey, r = margey)), legend.position = "none", strip.text = element_text(size = strip_size))
my_g3C <- grobTree(rectGrob(gp = gpar(fill = col_background)), textGrob("C", x = 0.5, hjust = 0.5, gp = gpar(col = col_text, cex = cex, fontface = "bold")))
fig3D = gelman.plot2(resZhalf, ncol = ncol) +
scale_x_continuous(breaks = c(2e5, 4e5)) + xlab("") +
scale_y_continuous(limits = limits, breaks = breaks) +
theme(axis.title.y = element_text(margin = margin(l = - margey, r = margey)), strip.text = element_text(size = strip_size), plot.margin = margin(b = - 5, t = 6, r = 6, l = 6))
my_g3D <- grobTree(rectGrob(gp = gpar(fill = col_background)), textGrob("D", x = 0.5, hjust = 0.5, gp = gpar(col = col_text, cex = cex, fontface = "bold")))
x_axis <- grobTree(textGrob("Last iteration in chain", x = 0.5, hjust = 0.5, gp = gpar(cex = 0.7)))
legend = cowplot::get_legend(fig3D)
grid.arrange(my_g3A, fig3A, my_g3B, fig3B, my_g3C, fig3C, my_g3D, fig3D + theme(legend.position = "none"), x_axis, legend, heights = c(1,9,1,9,1,9,1,9, 1, 2))
```
\
```{r table4, cache = FALSE, echo = FALSE}
#| label: tbl-posterior
#| tbl-cap: "Summary of the posterior distributions obtained with the \"100\\% Y - 0\\% Z\" scenario for the model parameters and treatment efficacies: posterior mean, standard deviation, 2.5 and 97.5 quantiles."
tab4 = cbind(summary(resY)$statistics, summary(resY)$quantiles) %>%
as.data.frame %>%
rownames_to_column(var = "Parameter") %>% select(- c(4, 5, c(7 : 9))) %>%
mutate(Parameter = recode(Parameter, "gamma0[1]" = "gamma0 - untreated", "gamma0[2]" = "gamma0 - Mavrik Jet",
"gamma0[3]" = "gamma0 - Movento", "gamma0[4]" = "gamma0 - Teppeki",
"gamma1[1]" = "gamma1 - untreated", "gamma1[2]" = "gamma1 - Mavrik Jet",
"gamma1[3]" = "gamma1 - Movento", "gamma1[4]" = "gamma1 - Teppeki" )) %>%
filter(!grepl("Eff", Parameter)) %>%
mutate(Parameter = recode(Parameter, "alpha0" = "\u03B1\u2080",
"alpha1" = "\u03B1\u2081",
"chi" = "\u03C7",
"eta" = "\u03B7",
"sigma0" = "\u03C3\u2080")) %>%
mutate(Parameter = gsub(x = Parameter, pattern = "gamma0", replacement = "\u03B3\u2080")) %>%
mutate(Parameter = gsub(x = Parameter, pattern = "gamma1", replacement = "\u03B3\u2081")) %>%
mutate_if(is.numeric, round, 2)
# re-ordering the parameters
tab4 = tab4[c(c(1 : 2), 13, c(3 : 12)), ]
rownames(tab4) = NULL
tab4 %>%
kbl(booktabs = TRUE, linesep = "") %>%
kable_styling()
```
### Estimated values of pesticide treatment efficacies
@fig-four presents the posterior means and the 95% credibility intervals of treatment efficacies at 6 days (A) and 12 days (B) after pesticide spray, for the "100% Y - 0% Z", "50% Y - 50% Z", "50% Y - 0% Z" and "0% Y - 50% Z" scenarios. Different scenarios are indicated by different colors. The x-axis presents the efficacy and the y-axis presents the treatments. Overall, the results obtained are consistent across scenarios; Teppeki and Movento show higher mean efficacies than Mavrik Jet, and the credibility intervals are narrower for Teppeki and Movento than for Mavrik Jet in all scenarios. The credibility interval of the "100% Y - 0% Z" scenario is narrower than that of the "50% Y - 50% Z" scenario, which is itself narrower than that of the "50% Y - 0% Z" and "0% Y - 50% Z" scenarios. Overall, the credibility interval sizes obtained with the "100% Y - 0% Z" scenario are 25% to 45% smaller than those obtained with the "50% Y - 0% Z" scenario (@tbl-sizes), aligning with the principle that increased data availability leads to more precise estimates. Results also indicate that credibility intervals are frequently larger with "0% Y - 50% Z" than with "50% Y - 0% Z," suggesting that more accurate estimates are achievable using Y compared to Z, at least in this specific case study. Interestingly, the sizes of the credibility intervals are approximately 25% smaller with "50% Y - 50% Z" compared to "50% Y - 0% Z," demonstrating that the combination of Y and Z observations collected from distinct trials proves beneficial and results in a reduction of uncertainty in the estimated treatment efficacy. This finding underscores the potential enhancement of treatment efficacy estimation through the combination of trials incorporating prevalence data and those incorporating intensity data.
```{r figure4, cache = FALSE, echo=FALSE, out.width = "100%", fig.height = 2.6}
#| label: fig-four
#| fig-cap: "Estimated treatment efficacies after 6 days (**A**) and after 12 days (**B**), with their credibility intervals. Colors correspond to the different scenarios."
#######################################################################################################################################################
y1 = 0.075; y2 = 0.225
# PR at 6 days
df_resY = cbind(summary(resY)$statistics[1 : 3, ], summary(resY)$quantiles[1 : 3, ]) %>%
as.data.frame %>% select(Mean, `2.5%`, `97.5%`) %>% rownames_to_column("Insecticide") %>%
pivot_longer(cols = c(3, 4)) %>% rename(P = Mean) %>%
mutate(name = recode(name, `2.5%` = "b_inf", `97.5%` = "b_sup")) %>%
mutate(Insecticide = recode(Insecticide, "Eff[2,1]" = "Mavrik Jet", "Eff[3,1]" = "Movento", "Eff[4,1]" = "Teppeki"))
df_resYZ = cbind(summary(resYZ)$statistics[1 : 3, ], summary(resYZ)$quantiles[1 : 3, ]) %>%
as.data.frame %>% select(Mean, `2.5%`, `97.5%`) %>% rownames_to_column("Insecticide") %>%
pivot_longer(cols = c(3, 4)) %>% rename(P = Mean) %>%
mutate(name = recode(name, `2.5%` = "b_inf", `97.5%` = "b_sup")) %>%
mutate(Insecticide = recode(Insecticide, "Eff[2,1]" = "Mavrik Jet", "Eff[3,1]" = "Movento", "Eff[4,1]" = "Teppeki"))
df_resYhalf = cbind(summary(resYhalf)$statistics[1 : 3, ], summary(resYhalf)$quantiles[1 : 3, ]) %>%
as.data.frame %>% select(Mean, `2.5%`, `97.5%`) %>% rownames_to_column("Insecticide") %>%
pivot_longer(cols = c(3, 4)) %>% rename(P = Mean) %>%
mutate(name = recode(name, `2.5%` = "b_inf", `97.5%` = "b_sup")) %>%
mutate(Insecticide = recode(Insecticide, "Eff[2,1]" = "Mavrik Jet", "Eff[3,1]" = "Movento", "Eff[4,1]" = "Teppeki"))
df_resZhalf = cbind(summary(resZhalf)$statistics[1 : 3, ], summary(resZhalf)$quantiles[1 : 3, ]) %>%
as.data.frame %>% select(Mean, `2.5%`, `97.5%`) %>% rownames_to_column("Insecticide") %>%
pivot_longer(cols = c(3, 4)) %>% rename(P = Mean) %>%
mutate(name = recode(name, `2.5%` = "b_inf", `97.5%` = "b_sup")) %>%
mutate(Insecticide = recode(Insecticide, "Eff[2,1]" = "Mavrik Jet", "Eff[3,1]" = "Movento", "Eff[4,1]" = "Teppeki"))
Order = "binf"
df_IC = rbind(df_resYhalf %>% mutate(Data = "50% Y - 0% Z"),
df_resYZ %>% mutate(Data = "50% Y - 50% Z"),
df_resY %>% mutate(Data = "100% Y - 0% Z"),
df_resZhalf %>% mutate(Data = "0% Y - 50% Z"))
prod_rescale = df_resY %>% group_by(Insecticide) %>% summarise(P = ifelse(Order == "moyenne", max(P), min(value))) %>%
as.data.frame %>% arrange(P) %>% select(Insecticide) %>% as.matrix %>% as.vector
x = c(1 : length(prod_rescale)); names(x) = prod_rescale
df_IC_6 = df_IC %>% mutate(Ordonnee = recode(Insecticide, !!!x)) %>%
mutate(Ordonnee = ifelse(Data == "100% Y - 0% Z", Ordonnee + y2,
ifelse(Data == "50% Y - 50% Z", Ordonnee + y1,
ifelse(Data == "50% Y - 0% Z", Ordonnee - y1, Ordonnee - y2))),
jour = "6")
# IC reduction wrt scenario 50% Y - 0% Z
df_len = df_IC %>% select(Insecticide, name, value, Data) %>% pivot_wider(names_from = c(name)) %>% as.data.frame %>%
mutate(IC_length = b_sup - b_inf) %>% arrange(Insecticide)
df_remp = df_len %>% filter(Data == "50% Y - 0% Z") %>% select(Insecticide, IC_length)
x = df_remp$IC_length; names(x) = df_remp$Insecticide
df_len = df_len %>% mutate(length50pcY = recode(Insecticide, !!!x)) %>%
mutate(`For efficacy at 6 days` = (1 - (IC_length / length50pcY)) * (- 100));
df_len6 = df_len %>% rename(Insecticide = Insecticide) %>% select(Insecticide, Data, `For efficacy at 6 days`) %>%
filter(Data != "50% Y - 0% Z") %>% arrange(Data) %>% mutate_if(is.numeric, round, digits = 1)
#######################################################################################################################################################
# PR at 12 days
df_resY = cbind(summary(resY)$statistics[4 : 6, ], summary(resY)$quantiles[4 : 6, ]) %>%
as.data.frame %>% select(Mean, `2.5%`, `97.5%`) %>% rownames_to_column("Insecticide") %>%
pivot_longer(cols = c(3, 4)) %>% rename(P = Mean) %>%
mutate(name = recode(name, `2.5%` = "b_inf", `97.5%` = "b_sup")) %>%
mutate(Insecticide = recode(Insecticide, "Eff[2,2]" = "Mavrik Jet", "Eff[3,2]" = "Movento", "Eff[4,2]" = "Teppeki"))
df_resYZ = cbind(summary(resYZ)$statistics[4 : 6, ], summary(resYZ)$quantiles[4 : 6, ]) %>%
as.data.frame %>% select(Mean, `2.5%`, `97.5%`) %>% rownames_to_column("Insecticide") %>%
pivot_longer(cols = c(3, 4)) %>% rename(P = Mean) %>%
mutate(name = recode(name, `2.5%` = "b_inf", `97.5%` = "b_sup")) %>%
mutate(Insecticide = recode(Insecticide, "Eff[2,2]" = "Mavrik Jet", "Eff[3,2]" = "Movento", "Eff[4,2]" = "Teppeki"))
df_resYhalf = cbind(summary(resYhalf)$statistics[4 : 6, ], summary(resYhalf)$quantiles[4 : 6, ]) %>%
as.data.frame %>% select(Mean, `2.5%`, `97.5%`) %>% rownames_to_column("Insecticide") %>%
pivot_longer(cols = c(3, 4)) %>% rename(P = Mean) %>%
mutate(name = recode(name, `2.5%` = "b_inf", `97.5%` = "b_sup")) %>%
mutate(Insecticide = recode(Insecticide, "Eff[2,2]" = "Mavrik Jet", "Eff[3,2]" = "Movento", "Eff[4,2]" = "Teppeki"))
df_resZhalf = cbind(summary(resZhalf)$statistics[4 : 6, ], summary(resZhalf)$quantiles[4 : 6, ]) %>%
as.data.frame %>% select(Mean, `2.5%`, `97.5%`) %>% rownames_to_column("Insecticide") %>%
pivot_longer(cols = c(3, 4)) %>% rename(P = Mean) %>%
mutate(name = recode(name, `2.5%` = "b_inf", `97.5%` = "b_sup")) %>%
mutate(Insecticide = recode(Insecticide, "Eff[2,2]" = "Mavrik Jet", "Eff[3,2]" = "Movento", "Eff[4,2]" = "Teppeki"))
Order = "binf"
df_IC = rbind(df_resYhalf %>% mutate(Data = "50% Y - 0% Z"),
df_resYZ %>% mutate(Data = "50% Y - 50% Z"),
df_resY %>% mutate(Data = "100% Y - 0% Z"),
df_resZhalf %>% mutate(Data = "0% Y - 50% Z"))
prod_rescale = df_resY %>% group_by(Insecticide) %>% summarise(P = ifelse(Order == "moyenne", max(P), min(value))) %>%
as.data.frame %>% arrange(P) %>% select(Insecticide) %>% as.matrix %>% as.vector
x = c(1 : length(prod_rescale)); names(x) = prod_rescale
df_IC_12 = df_IC %>% mutate(Ordonnee = recode(Insecticide, !!!x)) %>%
mutate(Ordonnee = ifelse(Data == "100% Y - 0% Z", Ordonnee + y2,
ifelse(Data == "50% Y - 50% Z", Ordonnee + y1,
ifelse(Data == "50% Y - 0% Z", Ordonnee - y1, Ordonnee - y2))),
jour = "12")
# IC reduction wrt scenario 50% Y - 0% Z
df_len = df_IC %>% select(Insecticide, name, value, Data) %>% pivot_wider(names_from = c(name)) %>% as.data.frame %>%
mutate(IC_length = b_sup - b_inf) %>% arrange(Insecticide)
df_remp = df_len %>% filter(Data == "50% Y - 0% Z") %>% select(Insecticide, IC_length)
x = df_remp$IC_length; names(x) = df_remp$Insecticide
df_len = df_len %>% mutate(length50pcY = recode(Insecticide, !!!x)) %>%
mutate(`For efficacy at 12 days` = (1 - (IC_length / length50pcY)) * (- 100));
df_len12 = df_len %>% rename(Insecticide = Insecticide) %>% select(Insecticide, Data, `For efficacy at 12 days`) %>%
filter(Data != "50% Y - 0% Z") %>% arrange(Data) %>% mutate_if(is.numeric, round, digits = 1)
#######################################################################################################################################################
df_IC = rbind(df_IC_6, df_IC_12) %>% mutate(jour = recode(jour, "6" = "A", "12" = "B"));
col = c("#9F248FFF", "#017A4AFF", "#F9791EFF", "#244579FF") # c(paletteer_d("awtools::spalette", n = 5))[- c(2)]
ggplot(df_IC) + geom_point(aes(x = P, y = Ordonnee, color = Data), size = 1) +
geom_line(aes(x = value, y = Ordonnee, group = Ordonnee, color = Data), size = 0.4) +
scale_color_manual(values = col, limits = c("100% Y - 0% Z", "50% Y - 50% Z", "50% Y - 0% Z", "0% Y - 50% Z")) +
scale_y_discrete(limits = prod_rescale) +
theme(legend.position = "bottom", legend.key.width = unit(0.5, 'cm'), legend.key.size = unit(0.4, "cm")) + ylab("") +
xlab("Treatment efficacy") + geom_vline(xintercept = 0, col = "#990033", size = 0.5, linetype = "dashed") +
theme(axis.title.x = element_text(margin = margin(b = - 7, t = 5))) +
labs(color = "Scenario") +
facet_wrap(~ jour, ncol = 2, scales = "free_x")
```
```{r table5, cache = FALSE, echo = FALSE}
#| label: tbl-sizes
#| tbl-cap: "Differences in the sizes of the 95 credibility intervals (CI) of the estimated treatment efficacies for the scenarios \"50\\% Y - 50\\% Z\", \"0\\% Y - 50\\% Z\" and \"100\\% Y - 0\\% Z\", compared to \"50\\% Y - 0\\% Z\". The difference is given in percentage. A positive (negative) value indicates an increase (decrease) of the credibility interval size. The third column indicates differences for the efficacy at 6 days after pesticide spray, and the fourth column indicates the difference for the efficacy at 12 days."
cbind(df_len6, df_len12 %>% select(- Data, - Insecticide)) %>%
kbl(booktabs = TRUE, linesep = "") %>%
kable_styling()
```
## Results obtained by simulation
```{r load_res_simulations, cache = FALSE, echo = FALSE}
df_col_simu = readRDS(file = "results/Simulations/df_col_simu.rds")
res_simu_sigma_1.87 = readRDS(file = "results/Simulations/res_simu_sigma_1.87.rds")
res_simu = res_simu_sigma_1.87
res_simu = res_simu %>% mutate(Type = recode(Type, "Y" = "100% Y - 0% Z", "Z" = "0% Y - 100% Z", "Zdemi" = "0% Y - 50% Z",
"Ydemi" = "50% Y - 0% Z", "YdemiZdemi" = "50% Y - 50% Z", "ZYdemi" = "50% Y - 100% Z"),
Erreur = recode(name, "MEgamma" = "A", "MEEf6" = "B", "MEEf12" = "C", "MEbest6" = "A", "MEbest12" = "B"),
value = ifelse(grepl("best", name), value * 100, value))
df_col = df_col_simu
types = res_simu$Type %>% unique
x = df_col %>% filter(type %in% types) %>% select(col) %>% as.matrix %>% as.vector; names(x) = types
res_simu = res_simu %>% mutate(Col = recode(Type, !!!x))
```
### Interest of combining trials with prevalence and trials with intensity
In this section, we consider the situation where only one type of observation is available per trial - pest prevalence or pest intensity. We compare the accuracy of the estimated parameters and estimated levels of treatment efficacy obtained by combining both types of trials compared to the results obtained using each set of trials separately.
The parameters used to generate the data are given in @tbl-params.
```{r table6, cache = FALSE, echo = FALSE}
#| label: tbl-params
#| tbl-cap: "Parameters used to generate virtual data"
tab6 = data.frame(Parameters = c(
# alpha_0 et alpha_1
c("\u03B1\u2080", "\u03B1\u2081"),
# gamma_0
c("\u03B3\u2080\u2080", "\u03B3\u2080\u2081", "\u03B3\u2080\u2082", "\u03B3\u2080\u2083"),
# gamma_1
c("\u03B3\u2081\u2080", "\u03B3\u2081\u2081", "\u03B3\u2081\u2082", "\u03B3\u2081\u2083"),
# sigma_0, eta et chi
c("\u03C3\u2080", "\u03B7", "\u03C7")
),
Values = c(alpha0, alpha1, gamma0, gamma1, sig0, eta, chi))
tab6 %>%
t %>%
kbl(booktabs = TRUE) %>%
kable_styling()
```
@fig-five represents the $E_{\gamma}$ (@eq-E_gamma) (**A**), $E_{Ef_6}$ (@eq-E_efficacy) (**B**) and $E_{Ef_{12}}$ (@eq-E_efficacy) (**C**) evaluation criteria for the "0% Y - 50% Z", "50% Y - 0% Z" and "50% Y - 50% Z" scenarios (different scenarios are indicated by different colors). The x-axis presents the number of trials and the y-axis the value of the criterion, averaged over the simulated data sets. For each number of trials and for each criterion, we observe that scenario "50% Y - 50% Z" gives a more accurate estimate than scenario "50% Y - 0% Z" which itself gives a more accurate estimate than scenario "0% Y - 50% Z". For example, for the efficacy at 6 days with 40 trials, the mean absolute error of scenario "50% Y - 0% Z" is 10% less than the mean absolute error of scenario "0% Y - 50% Z" (0.38 vs 0.42). The mean absolute error of scenario "50% Y - 50% Z" is 35% less than that of scenario "0% Y - 50% Z" (0.26 vs 0.42). The values of the three criteria decrease with the number of trials. The $E_{\gamma}$ criterion decreases from 0.62 with 20 trials to 0.32 with 80 trials for the "50% Y - 50% Z" scenario. A value of 20 trials is therefore not sufficient to obtain an accurate estimate of the parameters.
```{r figure5, cache = FALSE, out.width="100%", fig.height = 2.5, echo = FALSE}
#| label: fig-five
#| fig-cap: "Values of the $E_{\\gamma}$ (@eq-E_gamma) (**A**), $E_{Ef_6}$ (@eq-E_efficacy) (**B**) and $E_{Ef_{12}}$ (@eq-E_efficacy) (**C**) mean absolute error criteria for the \"0% Y - 50% Z\", \"50% Y - 0% Z\" and \"50% Y - 50% Z\" scenarios. The x-axis presents the number of trials and the y-axis presents the mean absolute error, averaged over the 974 simulated data sets. Different colors correspond to different scenarios."
types = c("50% Y - 0% Z", "50% Y - 50% Z", "0% Y - 50% Z")
res_temp = res_simu %>% filter(Type %in% types)
col_temp = df_col %>% arrange(type) %>% filter(type %in% types) %>% select(col) %>% as.matrix %>% as.vector
ggplot(res_temp %>% filter(!grepl("best", name))) + geom_point(aes(x = I, y = value, color = Type), size = 1) +
geom_line(aes(x = I, y = value, color = Type), size = 0.2) + xlab("Number of trials") +
facet_wrap(~ Erreur, nrow = 1, scales = "free") + theme(legend.position = "bottom") +
ylab("Mean absolute error") + theme(legend.title = element_blank(), axis.title.x = element_text(margin = margin(t = 5, b = -8))) +
scale_color_manual(values = col_temp)
```
@fig-six presents the percentages of cases where the best treatment at 6 days (A) and 12 days (B) has been correctly identified for the "50% Y - 0% Z", "0% Y - 50% Z" and "50% Y - 50% Z" scenarios. The x-axis presents the number of trials and the y-axis the percentage of cases where the treatment identification is correct. In general, the best treatment is better identified when the number of trials increases. With the "50% Y - 50% Z" scenario, the best treatment at 6 days is well identified in 69% of cases with 20 trials and in 85% of cases with 80 trials. For each number of trials, the percentage of correctly identification is higher for the "50% Y - 50% Z" scenario than for the other two, and the scenario "50% Y - 0% Z" generally gives better results than the scenario "0% Y - 50% Z", except at 6 days with 20 trials. For example, at 12 days after treatment and with 40 trials, the percentage of correct identification is 5% higher with scenario "50% Y - 50% Z" than with scenario "50% Y - 0% Z" (78 vs 73), and 4% higher with the scenario "50% Y - 0% Z" than with scenario "0% Y - 50% Z" (73 vs 69). These results show the interest of combining prevalence and intensity data for assessing the efficacy of treatments and identifying the best treatments.
```{r figure6, cache = FALSE, out.width="100%", fig.height = 3, echo = FALSE}
#| label: fig-six
#| fig-cap: "Comparison of proportion cases where the best treatment is correctly identified in the \"0% Y - 50% Z\", \"50% Y - 0% Z\" and \"50% Y - 50% Z\" scenarios. The x-axis represents the number of trials and the y-axis represents the percentage of cases where the best treatment has been correctly identified at 6 days (A) and 12 days (B), over the 974 simulated data sets. Different colors correspond to different scenarios."
ggplot(res_temp %>% filter(grepl("best", name))) + geom_point(aes(x = I, y = value, color = Type), size = 1) +
geom_line(aes(x = I, y = value, color = Type), size = 0.2) + xlab("Number of trials") +
facet_wrap(~ Erreur, nrow = 1, scales = "free") + theme(legend.position = "bottom") +
ylab("Best treatment identification percentage") + theme(legend.title = element_blank(), axis.title.x = element_text(margin = margin(t = 5, b = - 8))) +
scale_color_manual(values = col_temp)
```
### Interest of adding intensity when prevalence is measured in all trials
We now consider a situation where prevalence is measured in each trial and intensity is measured in only some of these trials. We compare the results obtained when the data are combined and when they are used separately. As the prevalence data are usually more accessible in practice and the intensity data more costly, it is important to evaluate the interest of adding intensity data in the statistical analysis.
The parameters used to generate the data are the same as in 3.2.1.
@fig-seven presents the evaluation criteria $E_{\gamma}$ (Fig 7A), $E_{Ef_6}$ (Fig 7B) and $E_{Ef_{12}}$ (Fig. 7C) for the scenarios "0% Y - 100% Z", "50% Y - 0% Z", "50% Y - 100% Z" and "50% Y - 50% Z". The x-axis presents the number of trials and the y-axis presents the value of the criterion, averaged over the number of simulated data sets. Results show that the mean absolute errors are lower in scenarios "50% Y - 100% Z" and "50% Y - 50% Z" than in "0% Y - 100% Z", and that the mean absolute errors are lower in the "0% Y - 100% Z" scenario than in the "50% Y - 0% Z" scenario. For example, considering the treatment efficacy at 12 days with 40 trials (Fig. 7C), the mean absolute errors are 13% lower in scenarios "50% Y - 100% Z" and "50% Y - 50% Z" than in "0% Y - 100% Z" (0.30 vs 0.34), and the mean absolute error is 11% lower in "0% Y - 100% Z" than in "50% Y - 0% Z" (0.34 vs 0.38). Clearly, adding intensity data to prevalence data improves the accuracy of the estimations. The mean absolute errors decrease with the number of trials. For example, the $E_{\gamma}$ criterion decreases from 0.62 with 20 trials to 0.32 with 80 trials for the "50% Y - 50% Z" scenario. As noted above, 20 trials is clearly not sufficient to obtain accurate results.
```{r figure7, cache = FALSE, out.width="100%", fig.height = 2.5, echo = FALSE}
#| label: fig-seven
#| fig-cap: "Values of $E_{\\gamma}$ (@eq-E_gamma) (**A**), $E_{Ef_6}$ (@eq-E_efficacy) (**B**) and $E_{Ef_{12}}$ (@eq-E_efficacy) (**C**) for the \"0% Y - 100% Z\", \"50% Y - 0% Z\", \"50% Y - 100% Z\" and \"50% Y - 50% Z\" scenarios. The x-axis presents the number of trials and the y-axis presents the absolute error averaged over the 974 simulated data sets. Different colors correspond to different scenarios."
types = c("50% Y - 0% Z", "0% Y - 100% Z", "50% Y - 100% Z", "50% Y - 50% Z")
res_temp = res_simu %>% filter(Type %in% types )
col_temp = df_col %>% arrange(type) %>% filter(type %in% types) %>% select(col) %>% as.matrix %>% as.vector
ggplot(res_temp %>% filter(!grepl("best", name))) + geom_point(aes(x = I, y = value, color = Type), size = 1) +
geom_line(aes(x = I, y = value, color = Type), size = 0.2) + xlab("Number of trials") +
facet_wrap(~ Erreur, nrow = 1, scales = "free") + theme(legend.position = "bottom") +
ylab("Mean absolute error") + theme(legend.title = element_blank(), axis.title.x = element_text(margin = margin(t = 5, b = - 8))) +
scale_color_manual(values = col_temp)
```
### Is it better to measure intensity or prevalence in new pest surveys?
In order to optimize the design of new pest surveys that might be conducted in the future, we determine which type of observations should be favored. For that purpose, we compare the results obtained with the "100% Y - 0% Z", "0% Y - 100% Z", "100% Y / 100% Z" and "100% W" scenarios (recall that W represents the unobserved number of aphids on each plant in the sample (@eq-model_W)), for different values of $\alpha_0$ that defines the average number of infested plants. With $\alpha_0$ = -1, the proportion of infested plants is generally much lower than one, while with $\alpha_0 = 2$, 100% of plants are generally infested. The case $\alpha_0$ = 1 leads to intermediate levels of infestation.
The three parameter sets used to generate the data are given in @tbl-params-future and are labeled A, B and C.
```{r table7, cache = FALSE, echo = FALSE}
#| label: tbl-params-future
#| tbl-cap: Parameters considered for the design of future pest surveys."
tab7 = data.frame(Parameters = c(
# alpha_0 et alpha_1
c("\u03B1\u2080", "\u03B1\u2081"),
# gamma_0
c("\u03B3\u2080\u2080", "\u03B3\u2080\u2081", "\u03B3\u2080\u2082", "\u03B3\u2080\u2083"),
# gamma_1
c("\u03B3\u2081\u2080", "\u03B3\u2081\u2081", "\u03B3\u2081\u2082", "\u03B3\u2081\u2083"),
# sigma_0, eta et chi
c("\u03C3\u2080", "\u03B7", "\u03C7")
),
"Values set A" = c(-1, alpha1, gamma0, gamma1, sig0, eta, chi),
"Values set B" = c(1, alpha1, gamma0, gamma1, sig0, eta, chi),
"Values set C" = c(2, alpha1, gamma0, gamma1, sig0, eta, chi)
)
tab7 %>%
t %>%
`rownames<-`(c("Set", "A", "B", "C")) %>%
kbl(booktabs = TRUE) %>%
column_spec(2, bold = TRUE) %>%
kable_styling()
```
@fig-height (A.1, B.1 and C.1) shows the mean absolute error $E_{\gamma}$ (@eq-E_gamma) as a function of the number of trials for the four scenarios "100% Y - 0% Z", "0% Y - 100% Z", "100% Y / 100% Z" and "100% W". @fig-height (A.2, B.2 and C.2) shows the distributions of infested plants with 40 trials corresponding to the three values of $\alpha_0$ reported in @tbl-params-future.
In case A (@tbl-params-future), the distribution of Z is such that Z is rarely close to 1 and often lower than 0.5 (@fig-height A.2). In case C, the distribution of Z is such that Z is often very close to 1 (100% of plants infested). Case B is intermediate. The accuracy of the estimated values of the model parameters $\gamma$ is better with scenario "100% Y - 0% Z" than with scenario "0% Y - 100% Z", for all number of trials. The advantage of "100% Y - 0% Z" is stronger in case of high pest prevalence (i.e., cases B and C) but very small in case of low pest prevalence (case A). For example, with 20 trials, the mean absolute error is 27% lower in the scenario "100% Y - 0% Z" than in "0% Y - 100% Z" for parameter set C (0.55 vs. 0.75), 10% lower for parameter set B (0.55 vs. 0.62), and not different for parameter set A (0.64). The "100% W" scenario leads to similar results as "100% Y - 0% Z", regardless of $\alpha_0$ and the number of trials. Results obtained with "100% Y / 100% Z" are generally similar to those obtained with "100% Y - 0% Z" and "100% W" but better than those obtained with the scenario "0% Y - 100% Z" in cases B and C. Here again, results show that 20 trials are not sufficient to obtain accurate parameter estimates.
```{r figure8, cache = FALSE, out.width="100%", fig.height = 6, echo = FALSE}
#| label: fig-height
#| fig-cap: "Comparison of the \"100% Y - 0% Z\", \"0% Y - 100% Z\", \"100% Y / 100% Z\" and \"100% W\" scenarios according to the distribution of $Z$ and the number of trials, using the $E_{\\gamma}$ criterion (@eq-E_gamma). **A**, **B** and **C** correspond to different $Z$ distributions which are given by A.2, B.2 and C.2 (distribution for a number of trials equal to 40). A, B and C respectively correspond to $\\alpha_0 =$ -1, 1 and 2. The details of the simulation parameters are given in @tbl-params-future. A1, B1 and C1 represent the absolute error $E_{\\gamma}$ averaged over the 974 simulated data sets as a function of the number of trials. Colors correspond to the different scenarios."
res_simu_q3_1 = readRDS(file = "results/Simulations/res_simu_q3_1.rds");
res_simu_q3_2 = readRDS(file = "results/Simulations/res_simu_q3_2.rds")
res_simu_q3_1 = res_simu_q3_1 %>% mutate(Type = recode(Type,
"100% Z" = "0% Y - 100% Z",
"100% Y" = "100% Y - 0% Z"))
types = res_simu_q3_1$Type %>% unique
col = df_col %>% arrange(type) %>% filter(type %in% types) %>% select(col) %>% as.matrix %>% as.vector
g1 = suppressWarnings(ggplot(res_simu_q3_1) + geom_point(aes(x = I, y = Mean_MAE, color = Type), size = 1) +
geom_line(aes(x = I, y = Mean_MAE, color = Type), size = 0.2) +
facet_wrap(~ alpha0, ncol = 1, labeller = label_wrap_gen(multi_line = FALSE)) +
ylab(TeX("Mean relative absolute error for $\\gamma$")) + scale_color_manual(values = col[c(1, 2, 3, 4)]) + theme(plot.margin = margin(l = 0)) +
xlab("Number of trials") + theme(legend.position = "bottom", plot.margin = margin(t = 0, l = 0)))
legend = cowplot::get_legend(g1)
g2 = suppressWarnings(ggplot(res_simu_q3_2) + geom_histogram(aes(Z), stat = "count", fill = "#4345a1") + facet_wrap(~ alpha0, ncol = 1) +
ylab("Number of observations") + theme(plot.margin = margin(r = 0, l = 15, t = 0)) + xlab("Number of infested beets"))
suppressWarnings(do.call("grid.arrange", c(list(legend), list(g1 + theme(legend.position = "none"), g2), list(ncol = 2, layout_matrix = rbind(c(2, 3), c(1, 1)), heights = c(1, 0.2)))))
```
# Conclusion
In order to evaluate pest treatment efficacy, numerous trials are conducted to monitor pest prevalence and intensity. Quite often, only one type of data is available and, when both prevalence and intensity are available, they are usually analysed separately. In this paper, we propose an alternative approach based on a hierarchical statistical model able to analyze intensity and prevalence data, simultaneously.
We successfully apply the model to a real data set including prevalence and incidence data collected to evaluate three pesticide treatments against aphids in sugar beets. The model is fitted to this data set using a Markov chain Monte Carlo algorithm, and convergence was quickly achieved after a few thousands iterations. Results show that the use of both prevalence and intensity data led to a substantial reduction of the uncertainty in the parameter estimates, compared to the use of a single type of data.
Results obtained from simulated data confirm that, when pest prevalence and pest intensity are collected separately in different trials, the model parameters are more accurately estimated combining both prevalence and intensity trials than using one type of trials only. We also find that, when prevalence data are collected in all trials and intensity data are collected in a subset of trials, estimations and pest treatment ranking are more accurate using both types of data than using prevalence data only. Moreover, when only one type of observation can be collected in a pest survey or in an experimental trial, our analysis indicates that it is usually better to collect intensity data than prevalence data, especially in situations where all or most of the plants are expected to be infested. Finally, our simulations show that it is unlikely to obtain accurate results with fewer than 40 trials when assessing the efficacy of pest control treatments based on prevalence and intensity data.
Although our framework is illustrated to compare the efficacy of plant pest treatments, it could be applied to other areas of research in the future, in particular for optimizing designs used in animal and human epidemiology. It is imperative to note that the ultimate selection of a design should be contingent upon the consideration of local constraints. As the model codes are made fully available, we believe that these codes could be used by different institutes to compare many different designs in the future, not only the types of designs considered in our paper. Of particular significance is the capability of our model to optimize sample sizes, with its impact contingent on the relative importance of within-trial variability compared to between-trial variability.
# Author contributions {.appendix}
AF and DM designed the study. AF performed the computations. AF and DM wrote the paper.
# Funding {.appendix}
This work was partly funded by the project SEPIM (PNRI) and by the RMT SDMAA.
# Data availability {.appendix }
Simulated data and model parameters are available without restriction. The original experimental data may be available upon request.
# Acknowledgements {.appendix }
We are grateful to Anabelle Laurent, Elma Raaijmakers, Kathleen Antoons and to the institute ITB (https://www.itbfr.org/) for their comments on this project.
We are grateful to the INRAE MIGALE bioinformatics facility (MIGALE, INRAE, 2020. Migale bioinformatics Facility, doi: 10.15454/1.5572390655343293E12) for providing help and/or computing and/or storage resources.
The authors are thankful to the institutes that provided us with the data, namely the French Institut Technique de la Betterave, the sugar beet organisation of the Netherlands, and the Institut Royal Belge pour l'Amélioration de la Betterave.
# Supplementary material {#sec-supplementary-material .unnumbered}
{{< include published-202312-favrot-hierarchical-supp.qmd >}}
# References {.unnumbered}
::: {#refs}
:::