Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory optimization for large datasets #241

Open
JHHatfield opened this issue Sep 28, 2022 · 5 comments
Open

Memory optimization for large datasets #241

JHHatfield opened this issue Sep 28, 2022 · 5 comments

Comments

@JHHatfield
Copy link
Contributor

Some of the functions have relatively high memory requirements (e.g. formatOccData and OccDetFunc). From what I can see these are mostly caused by the cast and merge steps. I have replaced some of the reshape2 functions in formatOccData with data.table ones. This seems to reduce the memory requirement and work as a small fix but looks complex to do in a comprehensive way.

@AugustT
Copy link
Member

AugustT commented Sep 28, 2022

@JHHatfield Nice to see you are still swimming these waters. It would be good to make a record of where else you see these changes needed to aid future work to overhaul and use data.table functions. Also would you like to make a pull request with the changes you have already made?

@JHHatfield
Copy link
Contributor Author

I have submitted my quick fix for formatOccData which deals with the size limit faced by the reshape2 version of dcast. The issue is that data.table requires data tables instead of data frames. The syntax differences mean a lot of changes would be needed for a full overhaul. I got around it here by using setDT then setDF to go from frame to table and back. I suppose the question is if the memory usage is a big enough problem to warrant such changes.

@03rcooke
Copy link
Contributor

03rcooke commented Oct 5, 2022

The alternative option would be to use tidyverse (e.g., dplyr and tidyr) functions to replace reshape2::dcast, this would likely be less memory intensive than reshape2::dcast, but more memory intensive than data.table. However it would be much easier to implement as it would work with dataframes.

Something like:

spp_vis <- dplyr::arrange(temp, species_name) %>% 
    tidyr::pivot_wider(names_from = species_name, values_from = pres, values_fill = FALSE) %>% 
    dplyr::arrange(visit)

rather than
spp_vis <- dcast(temp, formula = visit ~ species_name, value.var = "pres", fill = FALSE, fun=unique)

I'm not sure the arranges are strictly necessary, but this way the outputs are identical

@JHHatfield
Copy link
Contributor Author

Sounds good, I will have a look. The quick fix for formatOccData works pretty well but when I started to have a look at occDetFunc its not really going to work. I will have a look at memory usage switching occDetFunc over to tidyverse functions. Although what is the plan for the function going forward if you are bringing NIMBLE in?

@03rcooke 03rcooke added this to the Sprint October/November 2022 milestone Oct 10, 2022
@03rcooke
Copy link
Contributor

I think most of the code in occDetFunc will stay the same, it'll just be that there is an option to run the model in nimble rather than jags. So I think it's worth thinking about how we could reduce the memory and increase the speed on all the old reshape2 bits of code. There's also this tidyfast package https://github.com/TysonStanley/tidyfast, which has the dt_pivot_wider() function which I think basically runs data.table::dcast, but fits in a pipeline that uses dataframes neater.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants