DIVERSITY_traits_imputation.Rmd

---
title: "DIVERSITY traits : imputation"
date: "26/10/2020"
output:
  html_document:
    number_sections: no
    toc: yes
    toc_float:
      collapsed: false
      smooth_scroll: false
---	


```{r, echo=FALSE}
library(knitr)
```

<br/>
<br/>


# Glossary

- **Distribution of missing data** = 
    - MCAR = missing completely at random
    - MAR = missing at random (when the presence of missing data for a given variable is related to the values of another variable)
    - MNAR = missing not at random

$\Rightarrow$ in functional trait databases, missingness is related to the frequency of the species and their abundances, and taxonomic and phylogenetic bias is commonplace $/!\setminus$

<br/>

- **Deletion method** =
    - delete species with missing data for the calculation of diversity indices
    - acceptable for estimation of the community-weighted mean trait value (CWM)  
    as long as it only concerns the minor species (should not exceed 20$\%$ of the total biomass of the community)

<br/>

- **Gower + PCoA** =
    - compute Gower distance (with missing data) and project the distance with a Principal Coordinate Analysis,  
    the axes being then used as functional traits
    - only relevant for functional diversity indices calculated from several traits
    - trait information gets lost and only multivariate approaches can be used

<br/>

- **Imputation method** = replace missing data with substituted values

- **Simple imputation** =  a single value is imputed for each missing datum

- **Multiple imputation** = Monte Carlo technique in which the missing values are replaced by $m > 1$ imputed values, combined to produce estimates, confidence intervals, missing data uncertainty 

- **Likelihood-based imputation** = imputation and analysis are conducted simultaneously


<br/>
<br/>


# Specific traits

<br/>

## 2014 : Tamme (Ecology)

*Measure the predictive power of simple plant traits to estimate species’ maximum dispersal distances :*

- dispersal distance is related **to a dispersal syndrome** (Willson 1993, Pärtel & Zobel 2007, Vittoz & Engler 2007), that may generally be deduced from seed morphology, but difficult to establish a direct association (Nogales 2007)
- **plant height** more important than seed mass in determining seed dispersal distances (Thomson et al. 2011)
- **maximum dispersal distance strongly correlated with mean dispersal distance** (Thomson et al. 2011) :

$$ log_{10}(\text{ maximumDist }) = 0.795 + 0.984 * log_{10}(\text{ meanDist })$$

<br/><br/>

<div style="text-align:center;">
<pre>depending on the set of species used and the trait data available,
different traits are important in describing dispersal distance patterns</pre>

<br/>

$\Rightarrow$ **(dispersal syndrome + growth form + terminal velocity)**  
explain $\sim$ 60$\%$ of the variation in maximum dispersal distances

$\Rightarrow$ **(dispersal syndrome + growth form)**  
explain $\sim$ 50$\%$ of the variation in maximum dispersal distances
</div>

with

* on average, maximum dispersal distances increase from :  
<div style="text-align:center;">
wind (no special mechanisms) < ballistic < ant < wind (special mechanisms) < animal  
herb < shrub < tree
</div>
* models with terminal velocity only apply to species with wind or ballistic dispersal syndrome
* in general, short-distance dispersal overestimated, and long-distance dispersal underestimated

<div style="text-align:center;">
$\Rightarrow$ **`dispeRsal` package**
</div>

<br/><br/>


# Comparison of methods

<br/>

## 2014 : Taugourdeau

*Effect of imputation methods on the evaluation of trait values at species level and on the subsequent calculation of functional diversity indices at community level :*

<div style="text-align:center;">
<pre>None of these methods was able to perform best for all the traits and indices</pre>

<pre>at the species level, the most accurate imputation method is not the same for all traits 
and in all cases, but dissimilarity and/or relationships method always the most accurate 
among the single imputation methods</pre>

<pre>key parameter to choose the adequate imputation method 
= the distribution of the trait value in the dataset
applying a transformation method to improve the distribution of the trait values prior to  
using a imputation method could be useful in improving the quality of the replacement</pre>

<pre>30% of the data missing = percentage of missing data as a limit 
for the utilization of these single imputation methods</pre>
</div>

- **Average method** = least accurate
- **Functional proximity between species** =
    - best method for traits with *unbalanced distribution*
    - but not an appropriate choice for a study on functional distance between species,  
    as functional distance would then be underestimated
    - less affected by the percentage of deleted data (but Gower dissimilarity cannot be calculated between two species if no trait for both species : not possible if missing data are too numerous)


- **Relationships between traits** =
    - best method for traits with *balanced distribution*
    - but the most sensitive to the level of missing data

- **MICE** = more accurate than all other methods for all traits except SLA
    - the correction model can be adapted to the distribution of the variable, but caution with unbalanced traits

<br/>

## 2014 : Penone

*Evaluate the performance of four approaches for estimating missing values in trait databases and test whether imputed datasets retain underlying allometric relationships among traits :*

<div style="text-align:center;">
<pre>adding phylogenetic information reduced the error for all approaches except kNN 
this result was stronger for missForest than for mice

whithout phylo, mice gave better results than missForest
this difference was not significant whith phylo</pre>

<pre>bias was lower when missing data were imputed rather than deleted,
especially when more than 30% of the data were missing.

with phylo, all approaches had comparable bias, 
significantly lower than bias in datasets with missing data

phylo reduced the bias for missForest but not for mice
</pre>

<pre>imputation approaches with phylogenetic variance-covariance matrix (Phylopars) 
and phylogenetic eigenvectors (missForest and mice) gave similar results

phylo did not improve estimation equally among traits,
being more important where phylogenetic signal is stronger 
and when there are no other traits with strong signal</pre>
</div>

<br/>

## 2015 : Schrodt

*A new approach (BHPMF) which imputes trait values based on the taxonomic hierarchy, structure within the trait matrix and trait–environment relationships at the same time as providing uncertainty estimates for each single trait prediction.*

<div style="text-align:center;">
<pre>prediction accuracy is not related to the number of entries per trait</pre>

<pre>BHPMF outperforms the species MEAN baseline in all aspects: 
RMSE and R2 of predicted versus observed entries are smaller and larger, respectively, 
for each individual trait, and trait–trait correlations are better retrieved 

aHPMF provides a concept to extrapolate from point measurements to species ranges 
accounting for intraspecific variability</pre>

<pre>explicitly taking environmental constraints into account adds little or no improvement to BHPMF</pre>

<pre>Advancing BHPMF to account for phylogenetic distances instead of taxonomy will improve 
opportunities for efficient out-of-sample predictions, at least in well-resolved clades</pre>
</div>

<br/>


## 2018 : Poyatos

*Assess the performance of different imputation methods to fill simulated gaps at different missingness levels in a spatially explicit plant trait dataset*

<div style="text-align:center;">
<pre>No method performed best consistently for the five studied traits, but, considering all traits 
and performance metrics, MICE informed by relevant ecological variables gave the best results.
However, at higher missingness (>30 %), species mean imputations and regression kriging 
tended to outperform MICE for some traits.</pre>

<pre>low missingness rates (10 %) = mice and kNN more accurate (NRMSE) than Mean
moderate/high missingness = mice and kNN comparable/outperformed by Mean, and specially by OrdKrig

trait covariation did not improve imputations at high missingness

results for OrdKrig, compared to mice and kNN, show that spatial structure, rather than
trait covariation, may provide more accurate trait imputations when gaps are frequent</pre>

<pre>Introducing auxiliary variables as predictors improved MICE performance substantially 
but these improvements were dependent on the specific predictor set and trait</pre>

<pre>
Using imputed or incomplete datasets did not lead to large differences 
in the studied trait relationships when missingness was < 50%
At high missingness, using imputed datasets led to comparatively larger 
departures from the relationships obtained with the complete dataset
No imputation method appeared to perform consistently better than others in preserving trait
relationships at high missingness levels and, under these conditions, using incomplete datasets
appeared to correctly reproduce the observed trait relationships in the complete dataset
</pre>
</div>

<br/>


## 2020 : Johnson

*Evaluate the performance of approaches for handling missing values when considering biased datasets*

<div style="text-align:center;">
<pre>Rphylopars = most accurate estimate of missing values, best preserve response–trait slope

estimates of missing data were still inaccurate, even with only 5% of values missing

Under severe biases, errors were high with every approach.
complete-case analysis frequently outperform Mice imputation and, to a lesser degree, BHPMF
</pre>

<pre>If the objective of the imputation is to produce estimates of missing values, 
single imputation is considered most effective
if the objective is to model imputed values against another variable, 
the added error in the multiple imputation is advantageous</pre>

<pre>Including phylogenetic information generally improved imputation performance in every method</pre>

<pre>Including the response in the imputation :
- when response–trait slope positive : decreased imputation error in all approaches, 
decreased slope error substantially in Mice (-> almost comparable with Rphylopars), 
increased slope error in Rphylopars and BHPMF
- when no relationship between trait and response : 
increased imputation and slope errors in every approach, but with a small effect

-> broadly advisable to include response with Mice, to exclude it with Rphylopars and BHPMF</pre>

<pre>Missingness, phylogenetic clustering and change in mean were important predictors of 
slope error, significant differences in slope error and imputation error.</pre>

<pre>there is no single best solution to deal with missing data
researchers need to assess the available data and consider the need for imputation versus 
limiting the scope of the study or completing analyses for separate groups. Use of data 
imputation should be scrutinized, checking for changes in the data before and after imputation, 
measuring: missingness, phylogenetic clustering, a change in mean and a change in slope</pre>

<pre>The threshold for deciding whether imputation is accurate depends on the research question.</pre>
</div>

- Currently, at least 160 packages for handling missing data available on the R-CRAN repository (Josse et al., 2020) $\rightarrow$ Testing all available imputation methods was not feasible
- imputation can only be successful if it accounts for the mechanism by which data are missing
- impact of phylogenetic signal strength on imputation performance already tested (Kim 2018, Molina-Venegas 2018); therefore, standardization of Pagel’s λ between phylogeny and traits at $\sim$ one

<br/> <br/>


# Synthesis

<br/>

## Methods description

<br/>

| METHODS | Group | Precisions | Use phylo | PAPER | SOFTWARE |
|-----------|------|---------------------------|---------|-----------|---------|
| mean/median | simple | replace with the mean/median of available trait values | No |
| kNN | simple | k-Nearest Neighbour | (eigenvectors from PCoA) | Troyanskaya 2001 | `VIM` pkg |
| random forest | simple | | (eigenvectors from PCoA) | | `missForest` pkg |
| dissimilarity | simple | *species with the same functional strategy have a similar set of functional traits* <br/> Gower distance + threshold | No | Westoby 2002, Diaz 2004, Taugourdeau 2014 | `FD` pkg |
| relationship | simple/multiple | regression models <br/> but also : kNN, random forest, matrix factorization... | (eigenvectors from PCoA) | Wrigth et al. 2004, 2006 | `dispeRsal` function (Tamme 2014) |
| ordinary/regression kriging | simple | | No | | `automap` pkg | 
| (R)phylopars | simple (but associated variance) | likelihood-based approach (`Phylopars`) using both phylogeny and allometric relationships among traits (phylogenetic variance-covariance matrix) | Yes | Bruggeman 2009, Goolsby 2017 | `Phylopars` soft, `Rphylopars` pkg |
| PMF | simple | Probabilistic Matrix Factorization, models a sparse matrix as the scalar product of 2 latent matrices to find a factorization minimizing the error between predicted and observed data | No | | |
| HPMF | simple | Hierarchical Probabilistic Matrix Factorization, coupled with phylogenetic information | (taxonomy) | Shan 2012 | |
| BHPMF | simple (but associated variance) | Bayesian Hierarchical Probabilistic Matrix Factorization, coupled with a Gibbs sampler | (taxonomy) | Schrodt 2015 | `BHMPF` pkg |
| MICE | multiple | Multivariate imputation by chained equations | (eigenvectors from PCoA) | Azur 2011 | `mice` pkg |
| jomo | multiple | JOint MOdelling approach for multiple imputation of multilevel data | | Quartagno 2019 | `jomo` pkg |
    
<br/><br/>

## Papers making comparisons

<br/>

| PAPERS |  Tamme (2014) | Taugourdeau (2014) | Penone (2014) |  Schrodt (2015) |  Poyatos (2018) |  Johnson (2020) |
|---|------------|-----|--------|------------|-----------|-------------|
| **no fam** | 102 | | | 358 | 4 | |
| **no sp** | 576 | 1054 | 273 | 14320 | 13 | 500 (sim) |
| **no traits (resp)** | 1 | 9 | 4 | 13 | 5 | 1 |
| **no traits (used)** | 5 | | | | | 4 |
| **no methods** | + relationship | + mean <br/> + median <br/> + dissimilarity <br/> + relationship <br/> + mice | + kNN <br/> + mice <br/> + missForest <br/> + Phylopars | + PMF <br/> + BHMPF <br/> + aHMPF | + mean and species mean <br/> + ordinary and regression kriging <br/> + kNN <br/> + mice | + BHMPF <br/> + Rphylopars <br/> + mice |
| **missing data** | | (sim) [10 prob of deletion $0.01 \rightarrow 0.46$] x [10 rep] | (sim) [$10\% \rightarrow 80\%$] x [10 rep] | (obs) mean/trait $\sim 79.9\%$ | (sim) [$10\% \rightarrow 80\%$] x [30 rep] | (sim) [2 resp-trait slopes] x [2 cor levels] x [4 bias types] x [2 bias severity] x [$5\% \rightarrow 80\%$] x [10 rep] |
| **evaluation** | $R^2$ | MRdAE | - NRMSE <br/> - effect of dataset/missing data/method on errors <br/> - slope deviation | - RMSE <br/> - $R^2$ <br/> - SMA regression <br/> - procrustes analysis | - NRMSE, KGE <br/> - correlation matrices deviation | - RMSE <br/> - slope deviation |
| **software** | `dispeRsal` function | 2 author's functions, `mice` pkg | `VIM`, `mice`, `missForest` pkg, `Phylopars` soft | `MATLAB` | `automap`, `VIM`, `mice` pkg| `BHMPF`, `Rphylopars`, `mice` pkg |

<br/><br/>

## Methods +/- (according to papers)

<br/>

| METHODS | Positive points | Negative points |
|---------|-------------------------------|-------------------------------|
| **mean/median** | + accurate for datasets with small percentages of missing values (Schafer 1999) | - least accurate (Taugourdeau 2014) |
| | | - ignore the variance of the imputed variables (Blonder 2016) 
| | | - severely altered trait distributions, introduced larger errors in selected trait correlations, tended to cause larger deviations in the correlation matrix (Poyatos 2018) |
| **kNN** | + can deal with categorical variables (nominal or ordinal) (Penone 2014) | - produced larger errors and induced more bias in the allometric relationship <br/> - must specify a value of the tuning parameter k which is difficult to determine a priori (Penone 2014) |
| | | - tends to introduce larger bias in bivariate trait relationships compared to MICE (Poyatos 2018) |
| **random forest** | + performed better without including phylogeny <br/> + can deal with categorical variables (nominal or ordinal) (Penone 2014) | |
| **dissimilarity** | + Gower dissimilarity can be computed with missing data | |
| | + most accurate when the trait distribution is unbalanced (Taugourdeau 2014) | - cannot be calculated between two species if no trait is documented for both species (Taugourdeau 2014) |
| **ordinary/regression kriging** | | - altered distributions and trait correlations more than mice but they performed similarly in terms of delta cormat at all missingness levels (Poyatos 2018) |
| **relationship** | | - depending on the set of species used and traits available, different traits are important (Tamme 2014) |
| | | - requires several traits per plant/species to be documented <br/> - in most case less accurate than the dissimilarity method <br/> - does not perform well on very unbalanced traits (like SNP) because the multilinear model is strongly governed by extreme values <br/> - very sensitive to the percentage of missing data (Taugourdeau 2014) |
| **(R)phylopars** | + uses a phylogeny and a sparse trait matrix to estimate simultaneously the across-species (phylogenetic) and within-species (phenotypic) trait covariance (similar to a phylogenetic mixed model) to reconstruct the ancestral state and impute missing values (Goolsby 2017) | - depends on the phylogenetic signal in a trait <br/> - cannot deal with categorical traits |
| **PMF** | | - like PCA, efficient if the original matrix is of low rank, i.e. if the axes of the original matrix provide strong correlations <br/> - accuracy worse than using species mean trait values to fill the gaps (Schrodt 2015) |
| **HPMF** | + satisfactory to predict trait values when information at the genus level is available (Shan 2012) <br/> + needs only at least one trait value per plant | |
| **BHPMF** | + On average, across all traits, outperforms PMF, MEAN and aHPMF, with MEAN being significantly more accurate than PMF <br/> + advantage over MEAN largest for ‘physiological traits’(leaf N, leaf P), and smaller for more ‘structural traits’ (seed mass, plant height) <br/> + BHPMF and MEAN capture these general trait–trait correlations, but BHPMF reproduces extreme values more accurately than MEAN and is therefore generally better at capturing the shape of the scatter of observed trait data (Schrodt 2015) | - presence of strong trait–trait correlations is a prerequisite for the accuracy of BHPMF (Schrodt 2015) |
| **MICE** | + more accurate than all other methods for all traits except for the specific leaf area (Taugourdeau 2014) | - affected by the percentage of missing data for 6 (whole subdatabase) and 7 (herbaceous subdatabase) traits (Taugourdeau 2014) |
| | + smaller error and bias as compared to other multiple imputation approaches (Ambler, Omar & Royston 2007) | |
| | + performed better without including phylogeny <br/> + can deal with categorical variables (nominal or ordinal) (Penone 2014) | - linear dependencies between variables cause fatal errors and should be eliminated before imputation (Penone 2014) |
| | + closely tracked observed trait distributions, introduced the least error in trait correlations under high missingness levels and yielded low delta cormat at extreme missingness levels (Poyatos 2018) <br/> + perform well when there is no true relationship between response and trait (Johnson 2020) | - perform poorly when there is a positive relationship between response and trait, even when including the response <br/> - phylogenetic information should only be included when data are missing with no bias or a weak bias(Johnson 2020) |
| **jomo** | | |

<br/>
<br/>


# Citations

<!-- - Audigier V, White I R, Jolani S, Debray T PA, Quartagno M, Carpenter J, van Buuren S and Resche-Rigon M. 2018. “Multiple Imputation for Multilevel Data with Continuous and Binary Variables.” Statistical Science 33 (2): 160–83. https://doi.org/10.1214/18-STS646 -->
- Azur M J, Stuart E A, Frangakis C and Leaf P J. 2011. “Multiple Imputation by Chained Equations: What Is It and How Does It Work?” International Journal of Methods in Psychiatric Research 20 (1): 40–49. https://doi.org/10.1002/mpr.329
<!-- - Blonder, Benjamin. 2016. “Do Hypervolumes Have Holes?” American Naturalist 187 (4): E93–105. https://doi.org/10.1086/685444 -->
- Diniz-Filho, J A F, Villalobos F and Bini L M. 2015. “The Best of Both Worlds: Phylogenetic Eigenvector Regression and Mapping.” Genetics and Molecular Biology 38 (3): 396–400.
- Goolsby, E W, Bruggeman J and Ané C. 2017. “Rphylopars: Fast Multivariate Phylogenetic Comparative Methods for Missing Data and within-Species Variation.” Methods in Ecology and Evolution 8 (1): 22–27. https://doi.org/10.1111/2041-210X.12612
<!-- - Huque Md H, Carlin J B, Simpson J A and Lee K J. 2018. “A Comparison of Multiple Imputation Methods for Missing Data in Longitudinal Studies 01 Mathematical Sciences.” BMC Medical Research Methodology 18 (1): 1–16. https://doi.org/10.1186/s12874-018-0615-6 -->
- Johnson T F, Isaac N J B, Paviolo A and González‐Suárez M. 2020. “Handling Missing Values in Trait Data.” Global Ecology and Biogeography, no. December 2019: 1–12. https://doi.org/10.1111/geb.13185
<!-- - Kim S W, Blomberg S P and Pandolfi J M. 2018. “Transcending Data Gaps: A Framework to Reduce Inferential Errors in Ecological Analyses.” Ecology Letters. https://doi.org/10.1111/ele.13089 -->
- Molina-Venegas R, Moreno-Saiz J C, Parga I C, Davies T J, Peres-Neto P R and Rodríguez M. 2018. “Assessing Among-Lineage Variability in Phylogenetic Imputation of Functional Trait Datasets.” Ecography 41 (10): 1740–49. https://doi.org/10.1111/ecog.03480
- Penone C, Davidson A D, Shoemaker K T, Di Marco M, Rondinini C, Brooks T M, Young B E, Graham C H and Costa G C. 2014. “Imputation of Missing Data in Life-History Trait Datasets: Which Approach Performs the Best?” Methods in Ecology and Evolution 5 (9): 961–70. https://doi.org/10.1111/2041-210X.12232
- Poyatos R, Sus O, Badiella L, Mencuccini M and Martínez-Vilalta J. 2018. “Gap-Filling a Spatially Explicit Plant Trait Database: Comparing Imputation Methods and Different Levels of Environmental Information.” Biogeosciences 15 (9): 2601–17. https://doi.org/10.5194/bg-15-2601-2018
- Quartagno M, Grund S and Carpenter J. 2019. “Jomo: A Flexible Package for Two-Level Joint Modelling Multiple Imputation.” R Journal 11 (2): 205–28. https://doi.org/10.32614/RJ-2019-028
- Schrodt F, Kattge J, Shan H, Fazayeli F, Joswig J, Banerjee A, Reichstein M, et al. 2015. “BHPMF - a Hierarchical Bayesian Approach to Gap-Filling and Trait Prediction for Macroecology and Functional Biogeography.” Global Ecology and Biogeography 24 (12): 1510–21. https://doi.org/10.1111/geb.12335
- Shan H, Kattge J, Reich P B, Banerjee A, Schrodt F and Reichstein M. 2012. “Gap Filling in the Plant Kingdom - Trait Prediction Using Hierarchical Probabilistic Matrix Factorization.” In Proceedings of the 29th International Conference on Machine Learning, ICML 2012, 2:1303–10.
- Tamme R, Götzenberger L, Zobel M, Bullock J M, Hooftman D AP, Kaasik A and Pärtel M. 2014. “Predicting Species’ Maximum Dispersal Distances from Simple Plant Traits.” Ecology 95 (2): 505–13. https://doi.org/10.1890/13-1000.1
- Taugourdeau S, Villerd J, Plantureux S, Huguenin-Elie O and Amiaud B. 2014. “Filling the Gap in Functional Trait Databases: Use of Ecological Hypotheses to Replace Missing Data.” Ecology and Evolution 4 (7): 944–58. https://doi.org/10.1002/ece3.989