Steps to impute multiple columns in a data frame using recursive partitioning and regression trees method implemented in R {dlookr} package
- Usually, we have an abundance matrix. Here we are going to use a data frame, so use the
tibble::rownames_to_column()
function to have the id column in your dataset; - This strategy is dependent on the {dlookr} package. I'm using the version dlookr_0.6.3;
- You already know something about the missing values in your dataset, that is, whether they are distributed at random or not;
- It is highly recommended to reduce the sparsity in your data before imputing, regardless of the quality of the method;
- Please, take a look at the {dlookr} R package documentation.
- We create a list to store the imputed columns;
- Loop through the columns to be imputed by applying the
dlookr::imputate_na
function from {dlookr} package to each column; - Return the list with the imputed columns
- data = a data frame containing the columns to be imputed and the id column;
- columns = the columns to be imputed. Here we use the column names from the second column to the last one;
- Here you can use the argument
colnames(your_dataframe)[2:ncol(your_dataframe)]
to iterate over the column names.
- Here you can use the argument
- id_column = the column to be used as id (e.g., gene, protein, phosphosite, etc.);
- method = the method to be used for imputation. Here we use the rpart method, but you can use any other method available in the {dlookr} package.
imputate_multiple_columns <- function(data, columns,
id_col, method = "rpart") {
imputation_list <- list()
for (i in columns) {
imputation_list[[i]] <- dlookr::imputate_na(data,
i, id_col,
method = method)
}
return(imputation_list)
}