Memory optimization for large datasets #241

JHHatfield · 2022-09-28T10:54:16Z

Some of the functions have relatively high memory requirements (e.g. formatOccData and OccDetFunc). From what I can see these are mostly caused by the cast and merge steps. I have replaced some of the reshape2 functions in formatOccData with data.table ones. This seems to reduce the memory requirement and work as a small fix but looks complex to do in a comprehensive way.

AugustT · 2022-09-28T11:31:43Z

@JHHatfield Nice to see you are still swimming these waters. It would be good to make a record of where else you see these changes needed to aid future work to overhaul and use data.table functions. Also would you like to make a pull request with the changes you have already made?

JHHatfield · 2022-09-28T17:08:01Z

I have submitted my quick fix for formatOccData which deals with the size limit faced by the reshape2 version of dcast. The issue is that data.table requires data tables instead of data frames. The syntax differences mean a lot of changes would be needed for a full overhaul. I got around it here by using setDT then setDF to go from frame to table and back. I suppose the question is if the memory usage is a big enough problem to warrant such changes.

03rcooke · 2022-10-05T15:14:26Z

The alternative option would be to use tidyverse (e.g., dplyr and tidyr) functions to replace reshape2::dcast, this would likely be less memory intensive than reshape2::dcast, but more memory intensive than data.table. However it would be much easier to implement as it would work with dataframes.

Something like:

spp_vis <- dplyr::arrange(temp, species_name) %>% 
    tidyr::pivot_wider(names_from = species_name, values_from = pres, values_fill = FALSE) %>% 
    dplyr::arrange(visit)

rather than
spp_vis <- dcast(temp, formula = visit ~ species_name, value.var = "pres", fill = FALSE, fun=unique)

I'm not sure the arranges are strictly necessary, but this way the outputs are identical

JHHatfield · 2022-10-06T09:29:45Z

Sounds good, I will have a look. The quick fix for formatOccData works pretty well but when I started to have a look at occDetFunc its not really going to work. I will have a look at memory usage switching occDetFunc over to tidyverse functions. Although what is the plan for the function going forward if you are bringing NIMBLE in?

03rcooke · 2022-10-13T11:20:30Z

I think most of the code in occDetFunc will stay the same, it'll just be that there is an option to run the model in nimble rather than jags. So I think it's worth thinking about how we could reduce the memory and increase the speed on all the old reshape2 bits of code. There's also this tidyfast package https://github.com/TysonStanley/tidyfast, which has the dt_pivot_wider() function which I think basically runs data.table::dcast, but fits in a pipeline that uses dataframes neater.

JHHatfield added the enhancement label Sep 28, 2022

03rcooke added this to the Sprint October/November 2022 milestone Oct 10, 2022

DylanCarbone added Difficulty - Medium Occupancy modelling Resource use labels Sep 19, 2024

DylanCarbone mentioned this issue Sep 19, 2024

Remove '%>%' and return to base #91

Closed

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory optimization for large datasets #241

Memory optimization for large datasets #241

JHHatfield commented Sep 28, 2022

AugustT commented Sep 28, 2022

JHHatfield commented Sep 28, 2022

03rcooke commented Oct 5, 2022 •

edited

Loading

JHHatfield commented Oct 6, 2022

03rcooke commented Oct 13, 2022

Memory optimization for large datasets #241

Memory optimization for large datasets #241

Comments

JHHatfield commented Sep 28, 2022

AugustT commented Sep 28, 2022

JHHatfield commented Sep 28, 2022

03rcooke commented Oct 5, 2022 • edited Loading

JHHatfield commented Oct 6, 2022

03rcooke commented Oct 13, 2022

03rcooke commented Oct 5, 2022 •

edited

Loading